Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Extracting Data
From 
 
Nick Cox <[email protected]> 
To 
 
"[email protected]" <[email protected]> 
Subject 
 
Re: st: Extracting Data 
Date 
 
Fri, 22 Nov 2013 09:24:47 +0000 
This sounds like a two-stage process. For example, you might use
-split- to split a variable containing the one or two names. ", Jr."
needs special treatment. I'd edit ", Jr" to "_Jr." and then edit back.
For "Principle" read "Principal" throughout.
Nick
[email protected]
On 22 November 2013 04:58, Becker Stein <[email protected]> wrote:
> Hi,
>
> I asked this question yesterday. I needed help creating a regex to
> extract data from a single string variable. Robert's solution was
> really helpful. I was able to generate the School District,
> School Name, and School Type variables. However, I run into problems
> trying to create the Principle and Assist. Principles variables.
> The gen principal =regexs(1) if regexm(s,"\((.+)\)") returns all of the
> contents in the
> parentheses, but I need the contents before the comma to generate the
> principle name variable and the contents after the comma to generate
> the assist. principle name (if any). It gets a little complicated
> because sometimes the names themselves have commas in them (as in the
> case of Robert Williams, Jr.) I've pasted some sample data below.
>
>
> [School District] School Name (Principle, Asst. Principal) School Type
>
> [Meadowfield] Park Square (Susan Sims, John Riley) Middle School
> [Somerset] Upton & Pride Day School (Judith  Taper) Elementary School
> [Temperly] Lakewood School (Jason Stevenson, Jill Harris ) K-12
> [Packard] W.E.B. Du Bois ( Robert Williams, Jr.) Middle School
>
>
> Thanks,
> Becker
>
> -----Original Message-----
> From: Robert Picard <[email protected]>
> To: statalist <[email protected]>
> Sent: Wed, Nov 20, 2013 10:51 pm
> Subject: Re: st: Extracting Data
>
> Becker
>
> Here's one way to parse each variable using regex functions.
>
> Robert
>
> * ---------------- begin example -------------------------
> clear
> input str244 s
> "[Meadowfield] Park Sq (Susan Sims) Middle School"
> "[Somerset] Upton & Pride School (Judith Taper) El School"
> "[Temperly] Lakewood (Jason Stevenson, Jill Harris ) K-12"
> "[Packard] W.E.B.Bos ( Bob Williams, Jr.) Middle School"
> end
>
> gen district = regexs(1) if regexm(s,"\[(.+)\]")
> gen sname = regexs(1) if regexm(s,"\](.+)\(")
> gen principal = regexs(1) if regexm(s,"\((.+)\)")
> gen stype = regexs(1) if regexm(s,"\)(.+)")
>
> list district sname principal stype
> * ---------------- end example ---------------------------
>
> On Wed, Nov 20, 2013 at 4:38 PM, Becker Stein <[email protected]>
> wrote:
>>
>>
>> -----Original Message-----
>> From: Becker Stein <[email protected]>
>> To: statalist <[email protected].>
>> Sent: Wed, Nov 20, 2013 9:23 pm
>> Subject: Help Extracting Data
>>
>> Hi,
>>
>> I'm trying to extract data from a single string variable, and I was
>> wondering if how to create a regular expression that I can
>> use to do so. I've tried to create one just to extract the school
>> name, but to no avail. My data is set up as: [school district] name of
>> school (name of principle, name of assistant principle (*if any))
>> school type. Below are some examples.
>>
>> [Meadowfield] Park Square (Susan Sims, John Riley) Middle School
>> [Somerset] Upton & Pride Day School (Judith  Taper) Elementary School
>> [Temperly] Lakewood School (Jason Stevenson, Jill Harris ) K-12
>> [Packard] W.E.B. Du Bois ( Robert Williams, Jr.) Middle School
>>
>> I would like to extract the school name, principle name and asst.
>> principle name as separate variables. Sometimes the names have special
>> characters such as an "&" (as in the case of Upton & Pride) or a ".".,
>> and the administrators section may have only have 1 name or 2 names
>> (separated by a comma). Also, some of the data in the brackets and
>> parentheses have extra spaces. I initially used the itrim function on
>> the variable, and it removed the extra spaces for the content outside
>
> of the
>>
>> brackets and
>> parentheses (i.e., school name and school type), but it didn't work
>
> for
>>
>> content inside of them (school district and principal names).
>> Thanks in advance for any/all help.
>>
>> Best,
>> Becker
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> *   http://www.ats.ucla.edu/stat/stata/
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/