Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | Nick Cox <n.j.cox@durham.ac.uk> |
To | "'statalist@hsphsun2.harvard.edu'" <statalist@hsphsun2.harvard.edu> |
Subject | st: RE: Splitting string variables "advanced" |
Date | Wed, 18 Jan 2012 15:09:54 +0000 |
This is a bit of a kludge but the technique may help. (I tried regex approaches including -moss- (SSC) without success, but there may well be a better solution that way.) gen copy = itrim(myvar) gen isnum = . local todo 1 quietly while `todo' { replace isnum = !missing(real(substr(copy, strpos(copy, ";") + 4, 1))) replace copy = subinstr(copy, ";", cond(isnum, "@", ","), 1) count if strpos(copy, ";") local todo = r(N) } The logic of this is 1. -itrim()- first. It shouldn't make anything more difficult, and it might help. 2. "Number" for you evidently means something beginning something like "US2" or "EP1". So I look for a numeric character in a certain position. 3. Depending on what is found, I replace ";" by "@" or ",". 4. Later I would -split- on "@". Clearly you should use a character not otherwise present which you can check with -count if strpos(myvariable, "@")-. Nick n.j.cox@durham.ac.uk Seliger Florian I want to split string variables with values such as: EP1763200-A1 -- EP1530342-A2 ; US2004199663-A1 HORVITZ E J (HORV-Individual); APACIBLE J T (APAC-Individual) HORVITZ E J, APACIBLE J T; US2004254998-A1 MICROSOFT CORP (MICT) HORVITZ E J At the end, there should be several variables and their values should look as follows: Var1 EP1763200-A1 -- EP1530342-A2 Var2 US2004199663-A1 HORVITZ E J (HORV-Individual); APACIBLE J T (APAC-Individual) HORVITZ E J, APACIBLE J T Var3 US2004254998-A1 MICROSOFT CORP (MICT) HORVITZ E J My problem is the following: I used split cp, p(" ; " "; ") but in this case, Stata will also split Var2 because of the semicolon. I'm searching for a way to tell Stata that it should keep the value of Var2 in one variable if there is a semicolon before a name. Stata shall be asked to split the variable only if there is a number after the semicolon. Alternatively, I would like to delete the confusing semicolon in a first step, then asking Stata to split the variable with split cp, p(" ; " "; "). * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/