Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

RE: st: Extracting data from mixed string


From   "Nick Cox" <n.j.cox@durham.ac.uk>
To   <statalist@hsphsun2.harvard.edu>
Subject   RE: st: Extracting data from mixed string
Date   Sun, 8 Feb 2004 17:41:33 -0000

Philip Ryan responding to Paul O'Brien f

> If it is possible for more
> than one to exist, then things get a bit trickier, more so if 
> we don't know how
> many drugs (or items) might be listed for each subject.  Here 
> is a more general
> solution. It involves separating the drug names with spaces, 
> not commas; making
> new variables to accommodate each item in the original list; 
> -reshape- ing the
> data to long form.  (This is Stata8, I _think_ it's OK in Stata7).
> 
> 
> gen byte id=_n   // make an identifier for each observation
> 
> replace  drug = subinstr(drug, "," , " " , .)  // substitute 
> spaces for commas
> 
> gen byte wc=wordcount(drug)  // how many drugs does each subject get?
> 
> qui summ wc   //    r(max) will hold the most drugs used (in 
> these data, 5)
> 
> forvalues i = 1/`r(max)' {    // manufacture r(max) = 5 new variables
>   gen drug`i' = ""            // initialize variables to 
> missing for each subj
>   }
> 
> order id   //cosmetic, I just like ids to be first!
> 
> drop wc    // don't need it anymore
> 
> forvalues i = 1/`r(max)' {             // peek names from 
> original variable
>   replace drug`i' = word(drug,`i')     // and poke them into 
> new variables
>   }
> 
> drop drug  // don't really need it anymore; if you do, then 
> make a copy
> 
> reshape long drug, i(id) j(drug_order) string
> 
> li in 1/10
> 
>      +-------------------------+
>      | id   drug_o~r      drug |
>      |-------------------------|
>   1. |  1          1      NONE |
>   2. |  1          2           |
>   3. |  1          3           |
>   4. |  1          4           |
>   5. |  1          5           |
>      |-------------------------|
>   6. |  2          1      NONE |
>   7. |  2          2      NONE |
>   8. |  2          3      NONE |
>   9. |  2          4    MATES6 |
>  10. |  2          5   NATURAL |
>      +-------------------------+
> 
> You can now edit this long data set in any way you see fit, 
> for example, -drop-
> ping observations that don't meet your criteria for OC.  Of 
> course, with 41,000
> original observations, you end up with 41000 x 5 ~ 200000 
> observations, which
> might be a problem depending on your system.....

For Stata 8, an alternative is available using -split-. 
I'd not use a byte identifier with your dataset size. 

gen id = _n  
split drug, p(,) 
keep id `r(varlist)'  
reshape long drug, i(id) j(drug_order)

For Stata 7, you could use the unofficial -split- 
from SSC: 

ssc inst split 
gen id == _n 
split drug, p(,) 
drop drug 
keep id drug* 
reshape long drug, i(id) j(drug_order) 

For wider discussions see 
http://www.stata.com/support/faqs/data/splitstr.html
http://www.stata.com/support/faqs/data/multresp.html

Nick 
n.j.cox@durham.ac.uk 

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index