[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
"Nick Cox" <n.j.cox@durham.ac.uk> |

To |
<statalist@hsphsun2.harvard.edu> |

Subject |
RE: st: -word()- with non space separator |

Date |
Thu, 24 Sep 2009 17:13:49 +0100 |

Yesterday Jeph Herrin asked a question that he, Martin Weiss and I found an interesting challenge. Subtracting various minor hiccups and histrionics, and adding a few glosses, a "rational reconstruction" of the thread follows that is to be considered a joint product from us all. We hope that it may prove interesting or useful to others. Nick n.j.cox@durham.ac.uk Problem ======= I have a dataset in which many variables are structured in a way that is difficult to handle. If a question has multiple checkboxes as possible answers, the response is stored as a string, with a number indicating each box checked and numbers separated by colons. Thus in one simple example myvar 1:2:3:5:6:7:8:9 1:2:3:6 1:2:3:4:5:7:8:9 1:2:3:5:7:9 1:2:3:5:7:8:9 2:3:4:6:9 1:2:3:5:6:7:8:9 1:2:7:8:9 7:9 This variable takes 9 values. I want to map the information to 9 different indicator variables, myvar_1-myvar_9, each indicating whether that number was selected. -split- does not help much, because of the differing number of values per string. That is, it produces myvar_1 which equals "7" for the last observation. So I am looking for a way to check whether a given string contains a given integer. A loop using -strpos()- would work with this example forv i=1/9 { gen byte myvar_`i'= strpos(myvar, "`i'") > 0 } but the code above would fall over with strings with two or more digits, as, for example, searching for "1" returns true even if there is only "11". Solutions ========= 1. Use regular expressions. forv i=1/9 { gen byte myvar_`i'= regexm(myvar,"^`i'$|^`i':|:`i':|:`i'$") } In words, this looks for a specified string on its own, OR at the start of the value followed by a colon, OR such a string between colons, OR such a string at the end of the value preceded by a colon. 9 here could be 29 or 39 or whatever is the highest indicator variable you will need. 2. You could use -strpos()-, but you need to work backwards and ensure That, once seen, longer strings are zapped to avoid being found later and acting as false positives. clonevar work = myvar qui forval i = 29(-1)1 { gen myvar_`i' = strpos(work, "`i'") > 0 replace work = subinstr(work, "`i'", "", .) } We work with a copy of -myvar- because we are going to change the data. A test of this method is that at the end -work- should contain only colons. As above, 29 is just an example, as whatever highest indicator variable is needed. 3. You could use -split-, but you need some processing afterwards. split myvar, generate(comp) parse(:) destring egen rowmax = rowmax(comp*) su rowmax, meanonly forv i=1/`r(max)'{ egen byte my`i' = anymatch(comp*), values(`i') } A merit of this solution is that it automatically finds the highest number specified (as the column maximum of the row maxima). Naturally, if there were 32 boxes but no respondent picked 30, 31, 32, that would lie beyond the reach of this code. But if we knew in advance that there were 32 boxes that respondents could possibly check, we could cut the code down to split myvar, generate(comp) parse(:) destring forv i=1/32 { egen byte my`i' = anymatch(comp*), values(`i') } Morals ====== There's more than one way to do it. Know loops, functions, -egen- functions and data management commands such as -split- and -clonevar-. -regexm()- and friends are there when all else fails. * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**References**:**st: -word()- with non space separator***From:*Jeph Herrin <junk@spandrel.net>

**Re: st: -word()- with non space separator***From:*Jeph Herrin <junk@spandrel.net>

**RE: st: -word()- with non space separator***From:*"Nick Cox" <n.j.cox@durham.ac.uk>

**Re: st: -word()- with non space separator***From:*Jeph Herrin <junk@spandrel.net>

**RE: st: -word()- with non space separator***From:*"Nick Cox" <n.j.cox@durham.ac.uk>

**RE: st: -word()- with non space separator***From:*"Martin Weiss" <martin.weiss1@gmx.de>

**RE: st: -word()- with non space separator***From:*"Nick Cox" <n.j.cox@durham.ac.uk>

**RE: st: -word()- with non space separator***From:*"Martin Weiss" <martin.weiss1@gmx.de>

**RE: st: -word()- with non space separator***From:*"Nick Cox" <n.j.cox@durham.ac.uk>

**RE: st: -word()- with non space separator***From:*"Martin Weiss" <martin.weiss1@gmx.de>

**RE: st: -word()- with non space separator***From:*"Nick Cox" <n.j.cox@durham.ac.uk>

- Prev by Date:
**RE: st: testing across different sub-samples** - Next by Date:
**Re: st: testing across different sub-samples** - Previous by thread:
**RE: st: -word()- with non space separator** - Next by thread:
**Re: st: -word()- with non space separator** - Index(es):

© Copyright 1996–2015 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |