Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Finding matching strings across vars

From	Steve Nakoneshny <[email protected]>
To	"[email protected]" <[email protected]>
Subject	Re: st: Finding matching strings across vars
Date	Fri, 14 Jun 2013 13:33:57 -0600

To close out the thread:

The code provided by Sergiy worked beautifully. I had to make some very small modifications in that v2 & v3 did not need to be -split- rather i tidied them up with -replace v2 = trim(itrim(v2))- and that my parsing string for v1 was "///" rather than "/". In the end, I have a list of 1,434 genes matched to either v2 or v3.

Steve

On 2013-06-13, at 10:56 PM, [email protected] wrote:

> Thanks Sergiy. I'll give it a go when get to the office in the morning. Your code looks like it should work perfectly. I'll report back afterwards.
> 
> Steve
> 
> Sent via carrier pigeon
> 
> On 2013-06-13, at 8:39 PM, "Sergiy Radyakin" <[email protected]> wrote:
> 
>> clear
>> input str10 v1 str10 v2 str10 v3
>>   A    B    C
>>   A    "B/C"    C
>>   B    D    F
>>   A    E    K
>>   A    "M/N"    "N/K"
>>   G    M    "B/G"
>> end
>> 
>> quietly compress
>> list
>> 
>> tempfile full f2 f3
>> 
>>   save "`full'"
>> 
>>   keep v2
>>   split v2, generate(gen) parse("/")
>>   drop v2
>>   gen v=_n
>>   reshape long gen, i(v)
>>   drop if missing(gen)
>>   drop v _j
>>   duplicates drop gen, force
>>   sort gen
>>   save "`f2'"
>> 
>>   use "`full'"
>>   keep v3
>>   split v3, generate(gen) parse("/")
>>   drop v3
>>   gen v=_n
>>   reshape long gen, i(v)
>>   drop if missing(gen)
>>   drop v _j
>>   duplicates drop gen, force
>>   sort gen
>>   save "`f3'"
>> 
>>   use "`full'"
>>   keep v1
>>   split v1, generate(gen) parse("/")
>>   drop v1
>>   gen v=_n
>>   reshape long gen, i(v)
>>   drop if missing(gen)
>>   drop v _j
>>   duplicates drop gen, force
>> 
>>   sort gen
>>   merge gen using `f2', nokeep
>>   tab _merge
>>   rename _merge v2
>> 
>>   sort gen
>>   merge gen using `f3', nokeep
>>   tab _merge
>>   rename _merge v3
>> 
>>   display "Genes from variable v1 that are also mentioned in
>> variables v2 or v3:"
>>   list gen if (v2==3) | (v3==3), clean noobs
>> 
>> On Thu, Jun 13, 2013 at 7:05 PM, Sergiy Radyakin <[email protected]> wrote:
>>> Does this capture the essence of the problem?
>>> 
>>> clear
>>> input str10 v1 str10 v2 str10 v3
>>>   A    B    C
>>>   A    "B/C"    C
>>>   B    D    F
>>>   A    E    K
>>>   A    "M/N"    "N/K"
>>>   G    M    "B/G"
>>> end
>>> 
>>> quietly compress
>>> list
>>> 
>>> On Thu, Jun 13, 2013 at 7:05 PM, Sergiy Radyakin <[email protected]> wrote:
>>>> The estimate is something like:
>>>> 25 minutes to create a good illustrative test dataset with
>>>> 
>>>> clear
>>>> input ...
>>>> end
>>>> 
>>>> then another 10 minutes for the solution.
>>>> 
>>>> On Thu, Jun 13, 2013 at 6:54 PM, Steve Nakoneshny <[email protected]> wrote:
>>>>> Dear Statalist,
>>>>> 
>>>>> A colleague has provided me with an excel file of 3 vars and 43,510 obs of gene names (all strings, all uppercase). Each var represents a different list of genes and he has asked me if there is an "easy" way in Stata to find out if any of the genes listed in var1 also appear in var2 and/or var3. To further complicate matters, the obs in var1 are non-unique and many have multiple alternate gene names like "ANKRD20A13P///ANKRD20A4///ANKRD20A2///ANKRD20A3///ANKRD20A11P///ANKRD20A9P///ANKRD20A1" embedded into the same obs.
>>>>> 
>>>>> In visualising a plan of attack, I'm thinking I need to read in var1, drop duplicates, split the longer obs parsing on "///", reshape long and drop duplicate once again to arrive at a single var list of unique gene names. This next step is where my plan starts to break down. I'm leaning towards appending the excel file again to read in var2 and var3, but then I'm not 100% sure on how to search for matches across each var or how to readily identify them once I do.
>>>>> 
>>>>> Any comments or suggestions would be greatly appreciated.
>>>>> 
>>>>> Steve
>>>>> 
>>>>> *
>>>>> *   For searches and help try:
>>>>> *   http://www.stata.com/help.cgi?search
>>>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>>>> *   http://www.ats.ucla.edu/stat/stata/
>> 
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> *   http://www.ats.ucla.edu/stat/stata/



*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

References:
- st: Finding matching strings across vars
  - From: Steve Nakoneshny <[email protected]>
- Re: st: Finding matching strings across vars
  - From: Sergiy Radyakin <[email protected]>
- Re: st: Finding matching strings across vars
  - From: Sergiy Radyakin <[email protected]>
- Re: st: Finding matching strings across vars
  - From: Sergiy Radyakin <[email protected]>
- Re: st: Finding matching strings across vars
  - From: Steve Nakoneshny <[email protected]>

Prev by Date: Re: st: RE: Testing for instrument relevance and overidentification when the endogeneous variable is used in interaction terms
Next by Date: Re: st: speeding up qreg
Previous by thread: Re: st: Finding matching strings across vars
Next by thread: st: Problems with matrix multiplication and syminv
Index(es):
- Date
- Thread