Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: AW: RE: Splitting string variables "advanced"

From   "Seliger Florian" <>
To   "''" <>
Subject   st: AW: RE: Splitting string variables "advanced"
Date   Thu, 19 Jan 2012 09:12:10 +0000

Thank you, Nick. That helped a lot.

Florian Seliger

ETH Zurich
KOF Swiss Economic Institute
Weinbergstrasse 35
8092 Zurich, Switzerland

-----Ursprüngliche Nachricht-----
Von: [] Im Auftrag von Nick Cox
Gesendet: Mittwoch, 18. Januar 2012 16:10
An: ''
Betreff: st: RE: Splitting string variables "advanced"

This is a bit of a kludge but the technique may help. (I tried regex approaches including -moss- (SSC) without success, but there may well be a better solution that way.) 

gen copy = itrim(myvar) 

gen isnum = . 

local todo 1
quietly while `todo' { 
	replace isnum = !missing(real(substr(copy, strpos(copy, ";") + 4, 1))) 
	replace copy = subinstr(copy, ";", cond(isnum, "@", ","), 1) 
      count if strpos(copy, ";") 
      local todo = r(N) 

The logic of this is 

1. -itrim()- first. It shouldn't make anything more difficult, and it might help. 

2. "Number" for you evidently means something beginning something like "US2" or "EP1". So I look for a numeric character in a certain position. 

3. Depending on what is found, I replace ";" by "@" or ",". 

4. Later I would -split- on "@". Clearly you should use a character not otherwise present which you can check with -count if strpos(myvariable, "@")-. 


Seliger Florian

I want to split string variables with values such as:

EP1763200-A1 -- EP1530342-A2   ;  US2004199663-A1   HORVITZ E J (HORV-Individual);  APACIBLE J T (APAC-Individual)   HORVITZ E J,  APACIBLE J T;  US2004254998-A1   MICROSOFT CORP (MICT)   HORVITZ E J


At the end, there should be several variables and their values should look as follows:


EP1763200-A1 -- EP1530342-A2   

US2004199663-A1   HORVITZ E J (HORV-Individual);  APACIBLE J T (APAC-Individual)   HORVITZ E J,  APACIBLE J T



My problem is the following: I used 

split cp, p(" ; " "; ")

but in this case, Stata will also split Var2 because of the semicolon.

I'm searching for a way to tell Stata that it should keep the value of Var2 in one variable if there is a semicolon before a name.

Stata shall be asked to split the variable only if there is a number after the semicolon.

Alternatively, I would like to delete the confusing semicolon in a first step, then asking Stata to split the variable with split cp, p(" ; " "; ").

*   For searches and help try:

*   For searches and help try:

© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index