Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: AW: RE: Splitting string variables "advanced"

From	"Seliger Florian" <[email protected]>
To	"'[email protected]'" <[email protected]>
Subject	st: AW: RE: Splitting string variables "advanced"
Date	Thu, 19 Jan 2012 09:12:10 +0000

Thank you, Nick. That helped a lot.


Best,
__________
 
Florian Seliger

ETH Zurich
KOF Swiss Economic Institute
WEH C 5
Weinbergstrasse 35
8092 Zurich, Switzerland
 
[email protected]
www.kof.ethz.ch
 


-----Ursprüngliche Nachricht-----
Von: [email protected] [mailto:[email protected]] Im Auftrag von Nick Cox
Gesendet: Mittwoch, 18. Januar 2012 16:10
An: '[email protected]'
Betreff: st: RE: Splitting string variables "advanced"

This is a bit of a kludge but the technique may help. (I tried regex approaches including -moss- (SSC) without success, but there may well be a better solution that way.) 

gen copy = itrim(myvar) 

gen isnum = . 

local todo 1
 
quietly while `todo' { 
	replace isnum = !missing(real(substr(copy, strpos(copy, ";") + 4, 1))) 
	replace copy = subinstr(copy, ";", cond(isnum, "@", ","), 1) 
      count if strpos(copy, ";") 
      local todo = r(N) 
}

The logic of this is 

1. -itrim()- first. It shouldn't make anything more difficult, and it might help. 

2. "Number" for you evidently means something beginning something like "US2" or "EP1". So I look for a numeric character in a certain position. 

3. Depending on what is found, I replace ";" by "@" or ",". 

4. Later I would -split- on "@". Clearly you should use a character not otherwise present which you can check with -count if strpos(myvariable, "@")-. 

Nick 
[email protected] 

Seliger Florian

I want to split string variables with values such as:

EP1763200-A1 -- EP1530342-A2   ;  US2004199663-A1   HORVITZ E J (HORV-Individual);  APACIBLE J T (APAC-Individual)   HORVITZ E J,  APACIBLE J T;  US2004254998-A1   MICROSOFT CORP (MICT)   HORVITZ E J

 

At the end, there should be several variables and their values should look as follows:

 

Var1
EP1763200-A1 -- EP1530342-A2   

Var2
US2004199663-A1   HORVITZ E J (HORV-Individual);  APACIBLE J T (APAC-Individual)   HORVITZ E J,  APACIBLE J T

Var3
US2004254998-A1   MICROSOFT CORP (MICT)   HORVITZ E J

 

My problem is the following: I used 

split cp, p(" ; " "; ")

but in this case, Stata will also split Var2 because of the semicolon.

I'm searching for a way to tell Stata that it should keep the value of Var2 in one variable if there is a semicolon before a name.

Stata shall be asked to split the variable only if there is a number after the semicolon.

Alternatively, I would like to delete the confusing semicolon in a first step, then asking Stata to split the variable with split cp, p(" ; " "; ").


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

References:
- st: Splitting string variables "advanced"
  - From: "Seliger Florian" <[email protected]>
- st: RE: Splitting string variables "advanced"
  - From: Nick Cox <[email protected]>

Prev by Date: st: lag and average regression
Next by Date: st: Test between NB1 and NB2 in Stata 10.1
Previous by thread: st: RE: Splitting string variables "advanced"
Next by thread: st: weighted t-test
Index(es):
- Date
- Thread