Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: extracting substrings from string, with irregular patterns


From   Fernando Luco <flucoestatalist@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: extracting substrings from string, with irregular patterns
Date   Thu, 16 Aug 2012 18:26:56 -0500

Thanks Nick,

Fernando

On Thu, Aug 16, 2012 at 1:40 PM, Nick Cox <njcoxstata@gmail.com> wrote:
> Here is a sketch of an approach (look, no regex). No code has been
> tested by computer or anybody reading it.
>
> The -city- comes after the last comma so reverse the string to make it easier
>
> gen city = reverse(station)
> replace city = substr(city, 1, strpos(city, ","))
> replace city = reverse(city)
>
> Now blank out -city-
>
> replace station = subinstr(station, city, "", .)
>
> Now zap the initial comma in -city-
>
> replace city = substr(city, 2, .)
>
> Now let's try the name.
>
> gen name = "Petrobas" if substr(lower(station), 1, 8) == "petrobas"
> replace name = "Copec" if substr(lower(station), 1, 5) == "copec"
>
> You are going to need to add similar statements.
>
> Once you have non-empty -name- on all observations, you can remove it
> from your main variable to leave the address as the residue.
>
> Nick
>
> On Thu, Aug 16, 2012 at 7:27 PM, Fernando Luco
> <flucoestatalist@gmail.com> wrote:
>
>> I have a dataset with one variable that contains the name of a gas
>> station, the address and the city in which the station is located. I
>> would like to separate all these in three different variables, name,
>> address and city. I have tried to use the regexs machinery but I
>> haven't been succesful. The data looks as follows
>>
>> COPEC AV. 11 DE SEPTIEMBRE 000,Tocopilla
>> PETROBRAS Av. Antonio Rendic 6850,Antofagasta
>> TERPEL Basilio Urrutia esq. Janequeo 312,Lautaro
>> Sin Bandera carrera 348,Lautaro
>> Sin Bandera Isabel Riquielme 403,Villarrica
>>
>> In the example the names are COPEC, PETROBRAS, TERPEL and Sin Bandera,
>> so there is a mixture of only uppercase and lowercase letters. The
>> addreses are written as: AV. 11 DE SEPTIEMBRE 000, Av. Antonio Rendic
>> 6850, Basilio Urrutia esq Janequeo 312, carrera 348 and Isabel
>> Riquielme 403. Finally, the city is what follows the comma, so
>> Tocopilla, Antofagasta, Lautaro and Villarrica.
>>
>> What I would like to do, even if it requires several steps, is to have
>> the name, address and city each as a different variable. I have tried
>> to separate everything by sub strings by spaces but it didn't work. I
>> also tried first recovering names in uppercase letters but it also
>> didn't work.
>>
>> Finally, I have 1,600 stations so I would like to avoid doing this one
>> by one. Any suggestions?
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index