Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Extracting Data

From   Robert Picard <[email protected]>
To   "[email protected]" <[email protected]>
Subject   Re: st: Extracting Data
Date   Wed, 20 Nov 2013 17:49:55 -0500


Here's one way to parse each variable using regex functions.


* ---------------- begin example -------------------------
input str244 s
"[Meadowfield] Park Sq (Susan Sims) Middle School"
"[Somerset] Upton & Pride School (Judith Taper) El School"
"[Temperly] Lakewood (Jason Stevenson, Jill Harris ) K-12"
"[Packard] W.E.B.Bos ( Bob Williams, Jr.) Middle School"

gen district = regexs(1) if regexm(s,"\[(.+)\]")
gen sname = regexs(1) if regexm(s,"\](.+)\(")
gen principal = regexs(1) if regexm(s,"\((.+)\)")
gen stype = regexs(1) if regexm(s,"\)(.+)")

list district sname principal stype
* ---------------- end example ---------------------------

On Wed, Nov 20, 2013 at 4:38 PM, Becker Stein <[email protected]> wrote:
> -----Original Message-----
> From: Becker Stein <[email protected]>
> To: statalist <[email protected].>
> Sent: Wed, Nov 20, 2013 9:23 pm
> Subject: Help Extracting Data
> Hi,
> I'm trying to extract data from a single string variable, and I was
> wondering if how to create a regular expression that I can
> use to do so. I've tried to create one just to extract the school
> name, but to no avail. My data is set up as: [school district] name of
> school (name of principle, name of assistant principle (*if any))
> school type. Below are some examples.
> [Meadowfield] Park Square (Susan Sims, John Riley) Middle School
> [Somerset] Upton & Pride Day School (Judith  Taper) Elementary School
> [Temperly] Lakewood School (Jason Stevenson, Jill Harris ) K-12
> [Packard] W.E.B. Du Bois ( Robert Williams, Jr.) Middle School
> I would like to extract the school name, principle name and asst.
> principle name as separate variables. Sometimes the names have special
> characters such as an "&" (as in the case of Upton & Pride) or a ".".,
> and the administrators section may have only have 1 name or 2 names
> (separated by a comma). Also, some of the data in the brackets and
> parentheses have extra spaces. I initially used the itrim function on
> the variable, and it removed the extra spaces for the content outside of the
> brackets and
> parentheses (i.e., school name and school type), but it didn't work for
> content inside of them (school district and principal names).
> Thanks in advance for any/all help.
> Best,
> Becker
> *
> *   For searches and help try:
> *
> *
> *
*   For searches and help try:

© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index