[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
RE: st: Selecting part of a LARGE file
Glenn Hoetker wrote
>> I have two files. File A has about 5000 unique values of
>> the variable
>> PATENT, which is 7 characters long. File B has 16 million
>> and several million unique values for PATENT. I want to do some
>> manipulation involving File B, but only for the observations that
>> correspond to the patent values found in File A. I am
>> currently using
>> merge on the two files to do this (actually mmerge as a wrapper for
>> ease), but wonder if there is an easier/faster way.
>> I attempted using vallist.ado in File A to generate a long
>> local macro
>> (say, _useme) and then doing
>> use FileB if index(patent, "'useme'")
>> I get 0 observations in this case (even though I know there are
>> matches). From the manual, it appears that index is
>> limited to strings
>> of 80 characters, anyway.
and I replied
>-vallist- is Patrick Joly's program.
>Quite apart from the 80 characters limit, what it does
>does nothing to help with your problem.
>Stripping down to a miniature analogue, suppose you have
>a string variable -myvar- which takes on distinct values
>"a" "b" "c".
>-vallist myvar- will return that set of values as a
>space-separated list, i.e.
>"a b c"
>If you then say
>... if index(myvar,"a b c")
>then this is true for _none_ of the observations;
>naturally, you report the same for your dataset.
David Kantor commented
> Putting that aside, and putting aside the 80-character
> limitation, the
> reason that
> use FileB if index(patent, "'useme'")
> gets no matches at all (when you do expect some) is that it
> should be...
> use FileB if index("'useme'", patent)
> -- the arguments are reversed.
> Nick Cox replied that you should expect no matches; he
> didn't say why.
David adds an important detail to explaining what went
wrong. Let me fill in the gap I apparently left.
Let's recap on what -index()- does, for string
expressions s1 and s2:
-index(s1,s2)- returns the position in s1 at which s2 is
first found or 0 if s1 does not contain s2.
David's example helps underline that this is
a function in which the order of arguments
does, typically, matter.
In my toy example with values "a" "b" "c"
-vallist- would give the composite list
"a b c", and
index("a b c", "a")
index("a b c", "b")
index("a b c", "c")
as particular instances of -index("a b c", myvar)-
are all non-zero (or treated as true). Conversely,
index("a", "a b c")
index("b", "a b c")
index("c", "a b c")
as particular instances of -index(myvar, "a b c")
are all zero (or treated as false), as in no
case is the composite string ever contained in
any of the elements.
As Glenn noted, the 80 character limit stops
-index("<value list>", varname)- being a practical
method for all but restricted problems.
* For searches and help try: