Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: Extracting data from mixed string


From   Philip Ryan <philip.ryan@adelaide.edu.au>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Extracting data from mixed string
Date   Sun, 8 Feb 2004 17:22:05 +1030

James Hesford answered the question posed by Paul O'Brien for the simple case
where only one OC occurred for each observation.  If it is possible for more
than one to exist, then things get a bit trickier, more so if we don't know how
many drugs (or items) might be listed for each subject.  Here is a more general
solution. It involves separating the drug names with spaces, not commas; making
new variables to accommodate each item in the original list; -reshape- ing the
data to long form.  (This is Stata8, I _think_ it's OK in Stata7).


gen byte id=_n   // make an identifier for each observation

replace  drug = subinstr(drug, "," , " " , .)  // substitute spaces for commas

gen byte wc=wordcount(drug)  // how many drugs does each subject get?

qui summ wc   //    r(max) will hold the most drugs used (in these data, 5)

forvalues i = 1/`r(max)' {    // manufacture r(max) = 5 new variables
  gen drug`i' = ""            // initialize variables to missing for each subj
  }

order id   //cosmetic, I just like ids to be first!

drop wc    // don't need it anymore

forvalues i = 1/`r(max)' {             // peek names from original variable
  replace drug`i' = word(drug,`i')     // and poke them into new variables
  }

drop drug  // don't really need it anymore; if you do, then make a copy

reshape long drug, i(id) j(drug_order) string

li in 1/10

     +-------------------------+
     | id   drug_o~r      drug |
     |-------------------------|
  1. |  1          1      NONE |
  2. |  1          2           |
  3. |  1          3           |
  4. |  1          4           |
  5. |  1          5           |
     |-------------------------|
  6. |  2          1      NONE |
  7. |  2          2      NONE |
  8. |  2          3      NONE |
  9. |  2          4    MATES6 |
 10. |  2          5   NATURAL |
     +-------------------------+

You can now edit this long data set in any way you see fit, for example, -drop-
ping observations that don't meet your criteria for OC.  Of course, with 41,000
original observations, you end up with 41000 x 5 ~ 200000 observations, which
might be a problem depending on your system.....


Phil


Quoting Paul O'Brien <paul.obrien@afe2.org.uk>:

> Database of 41,000 clinic visits with 26 variables for each visit. One 
> variable  is a string (stk) of contraceptives prescribed, containing 1 
> to 4 types, separated by commas.
> 
> stk
> NONE,OVYS
> NONE,NONE,NONE,MATES6,NATURAL
> NONE,MICROG
> NONE,NONE
> NONE,FEMODE,NATURAL
> NONE,NONE
> LOGED
> NONE,NONE,MICROG,MATES6,NATURAL
> NONE
> NONE,LEVONEL,MICROG
> MICROG
> NONE,NONE,LOG
> NONE,NONE,NONE
> NONE,MATES6,NATURAL
> NONE,NONE
> MICROG,ULTRA
> NATURAL
> NONE,MICROG,NATURAL
> 
> Some are oral contraceptives:OVYS   MICROG  FEMODE, LOGED
> 
> The position of the oral contraceptives in the string varies.
> 
> I want all the oral contraceptives in one column:
> 
> OVYS
> blank
> MATES6
> MICROG
> blank
> FEMO
> blank
> LOGED
> 
> How?
> 
> Help gratefull received.
> 
> Paul
> 
> *
> *   For searches and help try:
> *   http://www.stata.com/support/faqs/res/findit.html
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
> 


-- 
Philip Ryan
Associate Professor
Department of Public Health
University of Adelaide
5005 South Australia
AUSTRALIA
CRICOS Provider Number 00123M
----------------------------------------------------------- 
This email message is intended only for the addressee(s) 
and contains information that may be confidential and/or 
copyright.  If you are not the intended recipient please 
notify the sender by reply email and immediately delete 
this email. Use, disclosure or reproduction of this email 
by anyone other than the intended recipient(s) is strictly 
prohibited. No representation is made that this email or 
any attachments are free of viruses. Virus scanning is 
recommended and is the responsibility of the recipient.
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index