Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Extracting Data


From   Sergiy Radyakin <[email protected]>
To   "[email protected]" <[email protected]>
Subject   Re: st: Extracting Data
Date   Sun, 24 Nov 2013 05:25:34 -0500

Steve, Nick, almost perfect!
I hate to spoil the fun, but here is a test case which it doesn't handle:

"[Testisland] Test (Peter Ma, PhD, John Ba, PhD) El School"
results in:
 5. |                   Peter Ma                PhD |

(note that Ma and Ba are common lastnames, besides being
MA=Master/Magister, and BA=Bachelor)

Interestingly, changing John's degree to MS causes correct parsing:
  5. |              Peter Ma, PhD        John Ba, MS |

After some investigation, I can be relatively confident that the
problem occurs every time the second person has the same suffix
(title) as the first one, which could be quite common (in case of
titles).



Here is another case with a different error:
"[Testisland] Test (Peter Ba, Sr. BA, John Ba, Jr. BA) El School"
which results in one person:
  5. | Peter Ba, Sr. BA, John Ba, Jr. BA                    |


The above cases seem to be fixable, but imho the most difficult part
of the assignment (the one without which I don't think one can search
for solution) is that I don't see a rule how one can decide whether:
"John Smith, Ba Ma"
is one person with two degrees, or two persons with no (or
unspecified) degrees. If you think you know the answer, meet Ba Ma:
http://in.linkedin.com/pub/ba-ma/38/4b2/838

Perhaps you could enforce case-sensitivity and check for caps in
degrees? Or rely on dots (as recommended here
http://www.slc.edu/style-guide/), or at least know how many people are
in the list?

Don't forget to extend the list of abbreviations with Prof., Hon.,
Rev., Rep., Sen., Gen., Capt., Sgt., Pvt., etc,etc,etc. And don't
forget that some of them can be (part) of perfectly real  names too!
e.g. John Capt:
http://www.linkedin.com/pub/john-capt/5/529/68b
or Peter Sen:
http://www.linkedin.com/in/petesen
or for that matter Amartya Sen:
en.wikipedia.org/wiki/Amartya_Sen
I've found numerous people with last names Hon Rev Rep Sen Gen Capt,
including multiple combinations in one person such as, e.g. :
http://cn.linkedin.com/pub/ma-sen/16/516/255
http://ca.linkedin.com/pub/ed-ma/1/386/784


Compiling a comprehensive list of only degrees is a serious task on
it's own. Start with AA AS AAS BA BS MA MS PhD EdD MD DDS DSc  LL.D.
BFA, BA/MA, BMus, DPT, MFA, MPH, MPT, MS, MSEd, MSW,
And wait till you get to Dr.sc.math, Dr.sc.agr and various foreign degrees...

School names can be fun too! You've guessed it: Yes! they can include
parentheses in the official school name:
http://profiles.dcps.dc.gov/Fillmore+Arts+Center+%28East%29
But that's a whole other story...


Is there anything in the data generating software that could eliminate
the need for parsing, or is it raw user-entered data? Avoid parsing if
possible, this is usually the safest way to go. Otherwise, check the
results very very carefully, after the program completes.

Best,
  Sergiy Radyakin

On Sat, Nov 23, 2013 at 3:19 PM, Steve Samuels <[email protected]> wrote:
> I hadn't read Nick's post when I wrote mine. Both ideas follow the same
> logic. I omitted the comma, but Nick's suggestion of creating a
> temporary placeholder is superior. Here's a revised version which
> incorporates Nick's idea. It also adds abbreviations with periods (".")
> and retains these where present.
>
> Steve
>
> *********************Code Begins**************************
> clear
> input str244 s
> "[Meadowfield] Park Sq (Susan Sims) Middle School"
> "[Somerset] Upton & Pride School (Judith Taper, MA PhD) El School"
> "[Temperly] Lakewood (Jason Stevenson Jr.,  B.A., Jill Harris, BA ) K-12"
> "[Packard] W.E.B.Bos ( Bob Williams, Jr.) Middle School"
> end
>
> local suffix "Jr. Sr. BA  B.A. B.S. BS MS M.Sc. Ph.D. PhD"
> gen p = regexs(1) if regexm(s,"\((.+)\)") /*Robert's code */
>
> /* Now remove a single comma preceding the suffixes */
> foreach x of local suffix {
> replace p  =regexs(1)+"_"+regexs(3)+regexs(4)+regexs(5) ///
>  if regexm(p,"(.*)(,)(.*)(`x')(.*)")
> }
>
> split p, p(",")
> foreach x of local suffix {
> replace p1 = subinstr(p1,"_",",",.)
> replace p2 = subinstr(p2,"_",",",.)
> }
> list p1 p2
> *******************Code Ends******************************
>
>
>
> On Nov 22, 2013, at 4:24 AM, Nick Cox wrote:
>
> This sounds like a two-stage process. For example, you might use
> -split- to split a variable containing the one or two names. ", Jr."
> needs special treatment. I'd edit ", Jr" to "_Jr." and then edit back.
>
> For "Principle" read "Principal" throughout.
> Nick
> [email protected]
>
>
> On 22 November 2013 04:58, Becker Stein <[email protected]> wrote:
>> Hi,
>>
>> I asked this question yesterday. I needed help creating a regex to
>> extract data from a single string variable. Robert's solution was
>> really helpful. I was able to generate the School District,
>> School Name, and School Type variables. However, I run into problems
>> trying to create the Principle and Assist. Principles variables.
>> The gen principal =regexs(1) if regexm(s,"\((.+)\)") returns all of the
>> contents in the
>> parentheses, but I need the contents before the comma to generate the
>> principle name variable and the contents after the comma to generate
>> the assist. principle name (if any). It gets a little complicated
>> because sometimes the names themselves have commas in them (as in the
>> case of Robert Williams, Jr.) I've pasted some sample data below.
>>
>>
>> [School District] School Name (Principle, Asst. Principal) School Type
>>
>> [Meadowfield] Park Square (Susan Sims, John Riley) Middle School
>> [Somerset] Upton & Pride Day School (Judith  Taper) Elementary School
>> [Temperly] Lakewood School (Jason Stevenson, Jill Harris ) K-12
>> [Packard] W.E.B. Du Bois ( Robert Williams, Jr.) Middle School
>>
>>
>> Thanks,
>> Becker
>>
>> -----Original Message-----
>> From: Robert Picard <[email protected]>
>> To: statalist <[email protected]>
>> Sent: Wed, Nov 20, 2013 10:51 pm
>> Subject: Re: st: Extracting Data
>>
>> Becker
>>
>> Here's one way to parse each variable using regex functions.
>>
>> Robert
>>
>> * ---------------- begin example -------------------------
>> clear
>> input str244 s
>> "[Meadowfield] Park Sq (Susan Sims) Middle School"
>> "[Somerset] Upton & Pride School (Judith Taper) El School"
>> "[Temperly] Lakewood (Jason Stevenson, Jill Harris ) K-12"
>> "[Packard] W.E.B.Bos ( Bob Williams, Jr.) Middle School"
>> end
>>
>> gen district = regexs(1) if regexm(s,"\[(.+)\]")
>> gen sname = regexs(1) if regexm(s,"\](.+)\(")
>> gen principal = regexs(1) if regexm(s,"\((.+)\)")
>> gen stype = regexs(1) if regexm(s,"\)(.+)")
>>
>> list district sname principal stype
>> * ---------------- end example ---------------------------
>>
>> On Wed, Nov 20, 2013 at 4:38 PM, Becker Stein <[email protected]>
>> wrote:
>>>
>>>
>>> -----Original Message-----
>>> From: Becker Stein <[email protected]>
>>> To: statalist <[email protected].>
>>> Sent: Wed, Nov 20, 2013 9:23 pm
>>> Subject: Help Extracting Data
>>>
>>> Hi,
>>>
>>> I'm trying to extract data from a single string variable, and I was
>>> wondering if how to create a regular expression that I can
>>> use to do so. I've tried to create one just to extract the school
>>> name, but to no avail. My data is set up as: [school district] name of
>>> school (name of principle, name of assistant principle (*if any))
>>> school type. Below are some examples.
>>>
>>> [Meadowfield] Park Square (Susan Sims, John Riley) Middle School
>>> [Somerset] Upton & Pride Day School (Judith  Taper) Elementary School
>>> [Temperly] Lakewood School (Jason Stevenson, Jill Harris ) K-12
>>> [Packard] W.E.B. Du Bois ( Robert Williams, Jr.) Middle School
>>>
>>> I would like to extract the school name, principle name and asst.
>>> principle name as separate variables. Sometimes the names have special
>>> characters such as an "&" (as in the case of Upton & Pride) or a ".".,
>>> and the administrators section may have only have 1 name or 2 names
>>> (separated by a comma). Also, some of the data in the brackets and
>>> parentheses have extra spaces. I initially used the itrim function on
>>> the variable, and it removed the extra spaces for the content outside
>>
>> of the
>>>
>>> brackets and
>>> parentheses (i.e., school name and school type), but it didn't work
>>
>> for
>>>
>>> content inside of them (school district and principal names).
>>> Thanks in advance for any/all help.
>>>
>>> Best,
>>> Becker
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> *
>>> *   For searches and help try:
>>> *   http://www.stata.com/help.cgi?search
>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>> *   http://www.ats.ucla.edu/stat/stata/
>>
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> *   http://www.ats.ucla.edu/stat/stata/
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> *   http://www.ats.ucla.edu/stat/stata/
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/
>
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index