Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: extracting substrings from string, with irregular patterns


From   Nick Cox <njcoxstata@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: extracting substrings from string, with irregular patterns
Date   Thu, 16 Aug 2012 19:40:23 +0100

Here is a sketch of an approach (look, no regex). No code has been
tested by computer or anybody reading it.

The -city- comes after the last comma so reverse the string to make it easier

gen city = reverse(station)
replace city = substr(city, 1, strpos(city, ","))
replace city = reverse(city)

Now blank out -city-

replace station = subinstr(station, city, "", .)

Now zap the initial comma in -city-

replace city = substr(city, 2, .)

Now let's try the name.

gen name = "Petrobas" if substr(lower(station), 1, 8) == "petrobas"
replace name = "Copec" if substr(lower(station), 1, 5) == "copec"

You are going to need to add similar statements.

Once you have non-empty -name- on all observations, you can remove it
from your main variable to leave the address as the residue.

Nick

On Thu, Aug 16, 2012 at 7:27 PM, Fernando Luco
<flucoestatalist@gmail.com> wrote:

> I have a dataset with one variable that contains the name of a gas
> station, the address and the city in which the station is located. I
> would like to separate all these in three different variables, name,
> address and city. I have tried to use the regexs machinery but I
> haven't been succesful. The data looks as follows
>
> COPEC AV. 11 DE SEPTIEMBRE 000,Tocopilla
> PETROBRAS Av. Antonio Rendic 6850,Antofagasta
> TERPEL Basilio Urrutia esq. Janequeo 312,Lautaro
> Sin Bandera carrera 348,Lautaro
> Sin Bandera Isabel Riquielme 403,Villarrica
>
> In the example the names are COPEC, PETROBRAS, TERPEL and Sin Bandera,
> so there is a mixture of only uppercase and lowercase letters. The
> addreses are written as: AV. 11 DE SEPTIEMBRE 000, Av. Antonio Rendic
> 6850, Basilio Urrutia esq Janequeo 312, carrera 348 and Isabel
> Riquielme 403. Finally, the city is what follows the comma, so
> Tocopilla, Antofagasta, Lautaro and Villarrica.
>
> What I would like to do, even if it requires several steps, is to have
> the name, address and city each as a different variable. I have tried
> to separate everything by sub strings by spaces but it didn't work. I
> also tried first recovering names in uppercase letters but it also
> didn't work.
>
> Finally, I have 1,600 stations so I would like to avoid doing this one
> by one. Any suggestions?
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index