Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: substringing long, varying length text variables into individual variables


From   "Tom Trikalinos" <ttrikalin@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: substringing long, varying length text variables into individual variables
Date   Wed, 2 Apr 2008 17:47:18 -0400

Hi Todd,

I use Stata 8 and I cannot touch strings with more that 80 chars. If
I'm not mistaken this is not the case for newer releases... so i use a
dummy example with less than 80 chars below...  I presume this would
work in Stata 9/10 - if not don't shoot me.

e.g., use

1   Str A|Str B|Str C|Str D
2   Str A|Str B|Str D
3   Str D

A. change the spaces to underscores in a clone variable, and then the
"|" to spaces. The -wordcount- and -word- functions of Stata use
spaces to parse (if someone knows how to use a different separator in
these functions, this step is superfluous.)

. gen newString = subinstr(originalString," ", "_",.)
. replace newString = subinstr(newString,"|", " ",.)

B. get the maximum number of "words" per record
. gen howMany = wordcount(newString)
. summ howMany
.forval i=1/`howMany' {
.     	gen des_`i' = word(newString,`i')
.}


This gives you

des_1   des_2   des_3   des_4

Str_A   Str_B   Str_C   Str_D
Str_A   Str_B   Str_D
Str_D

C.  You can easily restore the spaces in the strings in des_1 to des_4
by changing the underscores back to spaces.

D. However, are you sure you want this as a final step? If you want to
have e.g. 4 dummies (one for Str_A, one for Str_B etc.):

str_a str_b str_c str_d
1       1        1      1
1       1        0      1
0       0        0      1

you would have to continue with reshaping long per record and then
back again reshaping wide per content of the string variable... Some
-encode-ing  will probably necessary also in the meanwhile...


That being said I'd do the "silly way" (using python or vim or sed) to
manipulate the strings outside Stata...


hth

tom





On undefined, Todd Wagner <twagner@stanford.edu> wrote:
> Hi,
>
>  I have data from a publicly available database
>  (clinicaltrials.gov).  This database has a number of text variables
>  that I want to break into individual variables and I could use some help.
>
>  For example, one of the variables is called study designs.  Here are
>  some data from the study designs variable
>
>  Treatment|Randomized|Double-Blind|Placebo Control|Parallel
>  Assignment|Safety/Efficacy Study
>  Prevention|Randomized|Open Label|Active Control|Parallel
>  Assignment|Bio-equivalence Study
>  Prevention|Randomized|Double Blind (Subject, Caregiver, Investigator,
>  Outcomes Assessor)|Crossover Assignment
>  Randomized|Single Blind|Active Control|Parallel Assignment
>  Natural History|Cross-Sectional|Case Control|Prospective Study
>  Treatment|Randomized|Open Label|Active Control|Parallel
>  Assignment|Efficacy Study
>  Treatment|Randomized|Double-Blind|Placebo Control|Single Group
>  Assignment|Safety/Efficacy Study
>  Treatment|Randomized|Open Label|Placebo Control|Parallel
>  Assignment|Safety/Efficacy Study
>  Treatment|Randomized|Double-Blind|Active Control|Parallel
>  Assignment|Safety/Efficacy Study
>  Prevention|Randomized|Double-Blind|Placebo Control|Parallel
>  Assignment|Safety/Efficacy Study
>  Treatment|Randomized|Single Blind (Investigator)|Placebo
>  Control|Parallel Assignment
>  Treatment|Randomized|Open Label|Active Control|Parallel
>  Assignment|Efficacy Study
>
>  What I want to do is parse this text using the "|" into individual variables
>
>  So the first case would be
>  des1            des2            des3            des4                    des5                    des6
>  Treatment       Randomized      Double-Blind    Placebo Control         Parallel
>  Assignment      Safety/Efficacy Study
>
>  I can think of a brute force way where I save this variable and my id
>  variable, change | to a comma, output as text, read the text into
>  stata as a comma separated file, and then merge it back into my
>  data.  Sounds silly, but perhaps it is the easiest.  Any other ideas?
>
>  Thanks,
>
>  Todd
>
>  *
>  *   For searches and help try:
>  *   http://www.stata.com/support/faqs/res/findit.html
>  *   http://www.stata.com/support/statalist/faq
>  *   http://www.ats.ucla.edu/stat/stata/
>
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index