Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Extracting Data
From
Steve Samuels <[email protected]>
To
[email protected]
Subject
Re: st: Extracting Data
Date
Sat, 23 Nov 2013 15:19:40 -0500
I hadn't read Nick's post when I wrote mine. Both ideas follow the same
logic. I omitted the comma, but Nick's suggestion of creating a
temporary placeholder is superior. Here's a revised version which
incorporates Nick's idea. It also adds abbreviations with periods (".")
and retains these where present.
Steve
*********************Code Begins**************************
clear
input str244 s
"[Meadowfield] Park Sq (Susan Sims) Middle School"
"[Somerset] Upton & Pride School (Judith Taper, MA PhD) El School"
"[Temperly] Lakewood (Jason Stevenson Jr., B.A., Jill Harris, BA ) K-12"
"[Packard] W.E.B.Bos ( Bob Williams, Jr.) Middle School"
end
local suffix "Jr. Sr. BA B.A. B.S. BS MS M.Sc. Ph.D. PhD"
gen p = regexs(1) if regexm(s,"\((.+)\)") /*Robert's code */
/* Now remove a single comma preceding the suffixes */
foreach x of local suffix {
replace p =regexs(1)+"_"+regexs(3)+regexs(4)+regexs(5) ///
if regexm(p,"(.*)(,)(.*)(`x')(.*)")
}
split p, p(",")
foreach x of local suffix {
replace p1 = subinstr(p1,"_",",",.)
replace p2 = subinstr(p2,"_",",",.)
}
list p1 p2
*******************Code Ends******************************
On Nov 22, 2013, at 4:24 AM, Nick Cox wrote:
This sounds like a two-stage process. For example, you might use
-split- to split a variable containing the one or two names. ", Jr."
needs special treatment. I'd edit ", Jr" to "_Jr." and then edit back.
For "Principle" read "Principal" throughout.
Nick
[email protected]
On 22 November 2013 04:58, Becker Stein <[email protected]> wrote:
> Hi,
>
> I asked this question yesterday. I needed help creating a regex to
> extract data from a single string variable. Robert's solution was
> really helpful. I was able to generate the School District,
> School Name, and School Type variables. However, I run into problems
> trying to create the Principle and Assist. Principles variables.
> The gen principal =regexs(1) if regexm(s,"\((.+)\)") returns all of the
> contents in the
> parentheses, but I need the contents before the comma to generate the
> principle name variable and the contents after the comma to generate
> the assist. principle name (if any). It gets a little complicated
> because sometimes the names themselves have commas in them (as in the
> case of Robert Williams, Jr.) I've pasted some sample data below.
>
>
> [School District] School Name (Principle, Asst. Principal) School Type
>
> [Meadowfield] Park Square (Susan Sims, John Riley) Middle School
> [Somerset] Upton & Pride Day School (Judith Taper) Elementary School
> [Temperly] Lakewood School (Jason Stevenson, Jill Harris ) K-12
> [Packard] W.E.B. Du Bois ( Robert Williams, Jr.) Middle School
>
>
> Thanks,
> Becker
>
> -----Original Message-----
> From: Robert Picard <[email protected]>
> To: statalist <[email protected]>
> Sent: Wed, Nov 20, 2013 10:51 pm
> Subject: Re: st: Extracting Data
>
> Becker
>
> Here's one way to parse each variable using regex functions.
>
> Robert
>
> * ---------------- begin example -------------------------
> clear
> input str244 s
> "[Meadowfield] Park Sq (Susan Sims) Middle School"
> "[Somerset] Upton & Pride School (Judith Taper) El School"
> "[Temperly] Lakewood (Jason Stevenson, Jill Harris ) K-12"
> "[Packard] W.E.B.Bos ( Bob Williams, Jr.) Middle School"
> end
>
> gen district = regexs(1) if regexm(s,"\[(.+)\]")
> gen sname = regexs(1) if regexm(s,"\](.+)\(")
> gen principal = regexs(1) if regexm(s,"\((.+)\)")
> gen stype = regexs(1) if regexm(s,"\)(.+)")
>
> list district sname principal stype
> * ---------------- end example ---------------------------
>
> On Wed, Nov 20, 2013 at 4:38 PM, Becker Stein <[email protected]>
> wrote:
>>
>>
>> -----Original Message-----
>> From: Becker Stein <[email protected]>
>> To: statalist <[email protected].>
>> Sent: Wed, Nov 20, 2013 9:23 pm
>> Subject: Help Extracting Data
>>
>> Hi,
>>
>> I'm trying to extract data from a single string variable, and I was
>> wondering if how to create a regular expression that I can
>> use to do so. I've tried to create one just to extract the school
>> name, but to no avail. My data is set up as: [school district] name of
>> school (name of principle, name of assistant principle (*if any))
>> school type. Below are some examples.
>>
>> [Meadowfield] Park Square (Susan Sims, John Riley) Middle School
>> [Somerset] Upton & Pride Day School (Judith Taper) Elementary School
>> [Temperly] Lakewood School (Jason Stevenson, Jill Harris ) K-12
>> [Packard] W.E.B. Du Bois ( Robert Williams, Jr.) Middle School
>>
>> I would like to extract the school name, principle name and asst.
>> principle name as separate variables. Sometimes the names have special
>> characters such as an "&" (as in the case of Upton & Pride) or a ".".,
>> and the administrators section may have only have 1 name or 2 names
>> (separated by a comma). Also, some of the data in the brackets and
>> parentheses have extra spaces. I initially used the itrim function on
>> the variable, and it removed the extra spaces for the content outside
>
> of the
>>
>> brackets and
>> parentheses (i.e., school name and school type), but it didn't work
>
> for
>>
>> content inside of them (school district and principal names).
>> Thanks in advance for any/all help.
>>
>> Best,
>> Becker
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *
>> * For searches and help try:
>> * http://www.stata.com/help.cgi?search
>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>> * http://www.ats.ucla.edu/stat/stata/
>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/faqs/resources/statalist-faq/
> * http://www.ats.ucla.edu/stat/stata/
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/faqs/resources/statalist-faq/
> * http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/