Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Extracting Data


From   Steve Samuels <[email protected]>
To   [email protected]
Subject   Re: st: Extracting Data
Date   Sat, 23 Nov 2013 15:19:40 -0500

I hadn't read Nick's post when I wrote mine. Both ideas follow the same
logic. I omitted the comma, but Nick's suggestion of creating a
temporary placeholder is superior. Here's a revised version which
incorporates Nick's idea. It also adds abbreviations with periods (".")
and retains these where present.

Steve

*********************Code Begins**************************
clear
input str244 s
"[Meadowfield] Park Sq (Susan Sims) Middle School"
"[Somerset] Upton & Pride School (Judith Taper, MA PhD) El School"
"[Temperly] Lakewood (Jason Stevenson Jr., B.A., Jill Harris, BA ) K-12"
"[Packard] W.E.B.Bos ( Bob Williams, Jr.) Middle School"
end

local suffix "Jr. Sr. BA  B.A. B.S. BS MS M.Sc. Ph.D. PhD"
gen p = regexs(1) if regexm(s,"\((.+)\)") /*Robert's code */

/* Now remove a single comma preceding the suffixes */
foreach x of local suffix {
replace p  =regexs(1)+"_"+regexs(3)+regexs(4)+regexs(5) ///
 if regexm(p,"(.*)(,)(.*)(`x')(.*)")
}

split p, p(",")
foreach x of local suffix {
replace p1 = subinstr(p1,"_",",",.)
replace p2 = subinstr(p2,"_",",",.)
}
list p1 p2
*******************Code Ends******************************



On Nov 22, 2013, at 4:24 AM, Nick Cox wrote:

This sounds like a two-stage process. For example, you might use
-split- to split a variable containing the one or two names. ", Jr."
needs special treatment. I'd edit ", Jr" to "_Jr." and then edit back.

For "Principle" read "Principal" throughout.
Nick
[email protected]


On 22 November 2013 04:58, Becker Stein <[email protected]> wrote:
> Hi,
> 
> I asked this question yesterday. I needed help creating a regex to
> extract data from a single string variable. Robert's solution was
> really helpful. I was able to generate the School District,
> School Name, and School Type variables. However, I run into problems
> trying to create the Principle and Assist. Principles variables.
> The gen principal =regexs(1) if regexm(s,"\((.+)\)") returns all of the
> contents in the
> parentheses, but I need the contents before the comma to generate the
> principle name variable and the contents after the comma to generate
> the assist. principle name (if any). It gets a little complicated
> because sometimes the names themselves have commas in them (as in the
> case of Robert Williams, Jr.) I've pasted some sample data below.
> 
> 
> [School District] School Name (Principle, Asst. Principal) School Type
> 
> [Meadowfield] Park Square (Susan Sims, John Riley) Middle School
> [Somerset] Upton & Pride Day School (Judith  Taper) Elementary School
> [Temperly] Lakewood School (Jason Stevenson, Jill Harris ) K-12
> [Packard] W.E.B. Du Bois ( Robert Williams, Jr.) Middle School
> 
> 
> Thanks,
> Becker
> 
> -----Original Message-----
> From: Robert Picard <[email protected]>
> To: statalist <[email protected]>
> Sent: Wed, Nov 20, 2013 10:51 pm
> Subject: Re: st: Extracting Data
> 
> Becker
> 
> Here's one way to parse each variable using regex functions.
> 
> Robert
> 
> * ---------------- begin example -------------------------
> clear
> input str244 s
> "[Meadowfield] Park Sq (Susan Sims) Middle School"
> "[Somerset] Upton & Pride School (Judith Taper) El School"
> "[Temperly] Lakewood (Jason Stevenson, Jill Harris ) K-12"
> "[Packard] W.E.B.Bos ( Bob Williams, Jr.) Middle School"
> end
> 
> gen district = regexs(1) if regexm(s,"\[(.+)\]")
> gen sname = regexs(1) if regexm(s,"\](.+)\(")
> gen principal = regexs(1) if regexm(s,"\((.+)\)")
> gen stype = regexs(1) if regexm(s,"\)(.+)")
> 
> list district sname principal stype
> * ---------------- end example ---------------------------
> 
> On Wed, Nov 20, 2013 at 4:38 PM, Becker Stein <[email protected]>
> wrote:
>> 
>> 
>> -----Original Message-----
>> From: Becker Stein <[email protected]>
>> To: statalist <[email protected].>
>> Sent: Wed, Nov 20, 2013 9:23 pm
>> Subject: Help Extracting Data
>> 
>> Hi,
>> 
>> I'm trying to extract data from a single string variable, and I was
>> wondering if how to create a regular expression that I can
>> use to do so. I've tried to create one just to extract the school
>> name, but to no avail. My data is set up as: [school district] name of
>> school (name of principle, name of assistant principle (*if any))
>> school type. Below are some examples.
>> 
>> [Meadowfield] Park Square (Susan Sims, John Riley) Middle School
>> [Somerset] Upton & Pride Day School (Judith  Taper) Elementary School
>> [Temperly] Lakewood School (Jason Stevenson, Jill Harris ) K-12
>> [Packard] W.E.B. Du Bois ( Robert Williams, Jr.) Middle School
>> 
>> I would like to extract the school name, principle name and asst.
>> principle name as separate variables. Sometimes the names have special
>> characters such as an "&" (as in the case of Upton & Pride) or a ".".,
>> and the administrators section may have only have 1 name or 2 names
>> (separated by a comma). Also, some of the data in the brackets and
>> parentheses have extra spaces. I initially used the itrim function on
>> the variable, and it removed the extra spaces for the content outside
> 
> of the
>> 
>> brackets and
>> parentheses (i.e., school name and school type), but it didn't work
> 
> for
>> 
>> content inside of them (school district and principal names).
>> Thanks in advance for any/all help.
>> 
>> Best,
>> Becker
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> *   http://www.ats.ucla.edu/stat/stata/
> 
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index