[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
Todd Wagner <twagner@stanford.edu> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: substringing long, varying length text variables into individual variables |

Date |
Wed, 02 Apr 2008 16:26:45 -0700 |

Thanks, Tom. Todd At 02:47 PM 4/2/2008, you wrote:

Hi Todd,

I use Stata 8 and I cannot touch strings with more that 80 chars. If

I'm not mistaken this is not the case for newer releases... so i use a

dummy example with less than 80 chars below... I presume this would

work in Stata 9/10 - if not don't shoot me.

e.g., use

1 Str A|Str B|Str C|Str D

2 Str A|Str B|Str D

3 Str D

A. change the spaces to underscores in a clone variable, and then the

"|" to spaces. The -wordcount- and -word- functions of Stata use

spaces to parse (if someone knows how to use a different separator in

these functions, this step is superfluous.)

. gen newString = subinstr(originalString," ", "_",.)

. replace newString = subinstr(newString,"|", " ",.)

B. get the maximum number of "words" per record

. gen howMany = wordcount(newString)

. summ howMany

.forval i=1/`howMany' {

. gen des_`i' = word(newString,`i')

.}

This gives you

des_1 des_2 des_3 des_4

Str_A Str_B Str_C Str_D

Str_A Str_B Str_D

Str_D

C. You can easily restore the spaces in the strings in des_1 to des_4

by changing the underscores back to spaces.

D. However, are you sure you want this as a final step? If you want to

have e.g. 4 dummies (one for Str_A, one for Str_B etc.):

str_a str_b str_c str_d

1 1 1 1

1 1 0 1

0 0 0 1

you would have to continue with reshaping long per record and then

back again reshaping wide per content of the string variable... Some

-encode-ing will probably necessary also in the meanwhile...

That being said I'd do the "silly way" (using python or vim or sed) to

manipulate the strings outside Stata...

hth

tom

On undefined, Todd Wagner <twagner@stanford.edu> wrote:

> Hi,

>

> I have data from a publicly available database

> (clinicaltrials.gov). This database has a number of text variables

> that I want to break into individual variables and I could use some help.

>

> For example, one of the variables is called study designs. Here are

> some data from the study designs variable

>

> Treatment|Randomized|Double-Blind|Placebo Control|Parallel

> Assignment|Safety/Efficacy Study

> Prevention|Randomized|Open Label|Active Control|Parallel

> Assignment|Bio-equivalence Study

> Prevention|Randomized|Double Blind (Subject, Caregiver, Investigator,

> Outcomes Assessor)|Crossover Assignment

> Randomized|Single Blind|Active Control|Parallel Assignment

> Natural History|Cross-Sectional|Case Control|Prospective Study

> Treatment|Randomized|Open Label|Active Control|Parallel

> Assignment|Efficacy Study

> Treatment|Randomized|Double-Blind|Placebo Control|Single Group

> Assignment|Safety/Efficacy Study

> Treatment|Randomized|Open Label|Placebo Control|Parallel

> Assignment|Safety/Efficacy Study

> Treatment|Randomized|Double-Blind|Active Control|Parallel

> Assignment|Safety/Efficacy Study

> Prevention|Randomized|Double-Blind|Placebo Control|Parallel

> Assignment|Safety/Efficacy Study

> Treatment|Randomized|Single Blind (Investigator)|Placebo

> Control|Parallel Assignment

> Treatment|Randomized|Open Label|Active Control|Parallel

> Assignment|Efficacy Study

>

> What I want to do is parse this text using the "|" into individual variables

>

> So the first case would be

> des1 des2 des3 des4 des5 des6

> Treatment Randomized Double-Blind Placebo Control Parallel

> Assignment Safety/Efficacy Study

>

> I can think of a brute force way where I save this variable and my id

> variable, change | to a comma, output as text, read the text into

> stata as a comma separated file, and then merge it back into my

> data. Sounds silly, but perhaps it is the easiest. Any other ideas?

>

> Thanks,

>

> Todd

>

> *

> * For searches and help try:

> * http://www.stata.com/support/faqs/res/findit.html

> * http://www.stata.com/support/statalist/faq

> * http://www.ats.ucla.edu/stat/stata/

>

*

* For searches and help try:

* http://www.stata.com/support/faqs/res/findit.html

* http://www.stata.com/support/statalist/faq

* http://www.ats.ucla.edu/stat/stata/

* * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**References**:**st: substringing long, varying length text variables into individual variables***From:*Todd Wagner <twagner@stanford.edu>

**Re: st: substringing long, varying length text variables into individual variables***From:*"Tom Trikalinos" <ttrikalin@gmail.com>

- Prev by Date:
**Re: st: RE: substringing long, varying length text variables into individual variables** - Next by Date:
**st: Cragg-Donald-Stock-Yogo critical values (was RE: Date: Wed, 2 Apr 2008 18:36:19 -0400)** - Previous by thread:
**Re: st: substringing long, varying length text variables into individual variables** - Next by thread:
**st: RE: substringing long, varying length text variables into individual variables** - Index(es):

© Copyright 1996–2014 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |