Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

RE: st: Selecting part of a LARGE file


From   "Nick Cox" <[email protected]>
To   <[email protected]>
Subject   RE: st: Selecting part of a LARGE file
Date   Sat, 7 Jun 2003 17:37:29 +0100

Glenn Hoetker wrote

>> I have two files.  File A has about 5000 unique values of
>> the variable
>> PATENT, which is 7 characters long.  File B has 16 million
>> observations
>> and several million unique values for PATENT.  I want to do some
>> manipulation involving File B, but only for the observations that
>> correspond to the patent values found in File A.   I am
>> currently using
>> merge on the two files to do this (actually mmerge as a wrapper for
>> ease), but wonder if there is an easier/faster way.
>>
>> I attempted using vallist.ado in File A to generate a long
>> local macro
>> (say, _useme) and then doing
>>
>> 	use FileB if index(patent, "'useme'")
>>
>> I get 0 observations in this case (even though I know there are
some
>> matches).  From the manual, it appears that index is
>> limited to strings
>> of 80 characters, anyway.

and I replied

>-vallist- is Patrick Joly's program.
>
>Quite apart from the 80 characters limit, what it does
>does nothing to help with your problem.
>
>Stripping down to a miniature analogue, suppose you have
>a string variable -myvar- which takes on distinct values
>"a" "b" "c".
>
>-vallist myvar- will return that set of values as a
>space-separated list, i.e.
>
>"a b c"
>
>If you then say
>
>... if index(myvar,"a b c")
>
>then this is true for _none_ of the observations;
>naturally, you report the same for your dataset.

< snip>

David Kantor commented
>
> Putting that aside, and putting aside the 80-character
> limitation, the
> reason that
>    use FileB if index(patent, "'useme'")
> gets no matches at all (when you do expect some) is that it
> should be...
>    use FileB if index("'useme'", patent)
>
> -- the arguments are reversed.
>
> Nick Cox replied that you should expect no matches; he
> didn't say why.

David adds an important detail to explaining what went
wrong. Let me fill in the gap I apparently left.

Let's recap on what -index()- does, for string
expressions s1 and s2:

-index(s1,s2)- returns the position in s1 at which s2 is
first found or 0 if s1 does not contain s2.

David's example helps underline that this is
a function in which the order of arguments
does, typically, matter.

In my toy example with values "a" "b" "c"
-vallist- would give the composite list
"a b c", and

index("a b c", "a")

index("a b c", "b")

index("a b c", "c")

as particular instances of -index("a b c", myvar)-
are all non-zero (or treated as true). Conversely,

index("a", "a b c")

index("b", "a b c")

index("c", "a b c")

as particular instances of -index(myvar, "a b c")
are all zero (or treated as false), as in no
case is the composite string ever contained in
any of the elements.

As Glenn noted, the 80 character limit stops
-index("<value list>", varname)- being a practical
method for all but restricted problems.

Nick
[email protected]

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index