Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | Robert Picard <picard@netbox.com> |
To | "statalist@hsphsun2.harvard.edu" <statalist@hsphsun2.harvard.edu> |
Subject | Re: st: Extracting Data |
Date | Wed, 20 Nov 2013 17:49:55 -0500 |
Becker Here's one way to parse each variable using regex functions. Robert * ---------------- begin example ------------------------- clear input str244 s "[Meadowfield] Park Sq (Susan Sims) Middle School" "[Somerset] Upton & Pride School (Judith Taper) El School" "[Temperly] Lakewood (Jason Stevenson, Jill Harris ) K-12" "[Packard] W.E.B.Bos ( Bob Williams, Jr.) Middle School" end gen district = regexs(1) if regexm(s,"\[(.+)\]") gen sname = regexs(1) if regexm(s,"\](.+)\(") gen principal = regexs(1) if regexm(s,"\((.+)\)") gen stype = regexs(1) if regexm(s,"\)(.+)") list district sname principal stype * ---------------- end example --------------------------- On Wed, Nov 20, 2013 at 4:38 PM, Becker Stein <becker.stein@aol.com> wrote: > > -----Original Message----- > From: Becker Stein <becker.stein@aol.com> > To: statalist <statalist@hsphsun2.harvard.edu.> > Sent: Wed, Nov 20, 2013 9:23 pm > Subject: Help Extracting Data > > Hi, > > I'm trying to extract data from a single string variable, and I was > wondering if how to create a regular expression that I can > use to do so. I've tried to create one just to extract the school > name, but to no avail. My data is set up as: [school district] name of > school (name of principle, name of assistant principle (*if any)) > school type. Below are some examples. > > [Meadowfield] Park Square (Susan Sims, John Riley) Middle School > [Somerset] Upton & Pride Day School (Judith Taper) Elementary School > [Temperly] Lakewood School (Jason Stevenson, Jill Harris ) K-12 > [Packard] W.E.B. Du Bois ( Robert Williams, Jr.) Middle School > > I would like to extract the school name, principle name and asst. > principle name as separate variables. Sometimes the names have special > characters such as an "&" (as in the case of Upton & Pride) or a "."., > and the administrators section may have only have 1 name or 2 names > (separated by a comma). Also, some of the data in the brackets and > parentheses have extra spaces. I initially used the itrim function on > the variable, and it removed the extra spaces for the content outside of the > brackets and > parentheses (i.e., school name and school type), but it didn't work for > content inside of them (school district and principal names). > Thanks in advance for any/all help. > > Best, > Becker > > > > > > > > > > > > > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/faqs/resources/statalist-faq/ > * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/