Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: st: data management - string function


From   Howard Lempel <HLempel@brookings.edu>
To   "statalist@hsphsun2.harvard.edu" <statalist@hsphsun2.harvard.edu>
Subject   RE: st: data management - string function
Date   Wed, 24 Dec 2008 11:48:12 -0500

Hi all,

Following from Sergiy's advice, I'd like to suggest that bw use regular expressions to only delete occurrences of Mr, Dr, etc. that occur at the beginning of a name.  This should save Dr. Mroz (or someone with last name Mr) from being deleted.  Someone with a first name of MR will still be in trouble (you may want to experiment with finding a way to only deleting titles from people where var1 is at least three words, saving someone with first name MR and no title in the data).

I don't have time to write out the full code, but see the regular expression FAQ here: http://www.stata.com/support/faqs/data/regex.html

Also look up -help regexm-

BW, carrot (^) tells Stata you are searching for characters at the beginning of a string only, so you probably want something to the effect of:

Gen var2 = regexr(var1,^("MR" | "MR." | "Mr" | . . .),)  

Note: That code is untested, unfinished, and written by someone w/o expertise on regular expressions (e.g. I'd need to look up exactly how the "OR" operator and parentheses work).

Hope this helps.
Howie

-----Original Message-----
From: owner-statalist@hsphsun2.harvard.edu [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of Sergiy Radyakin
Sent: Wednesday, December 24, 2008 11:30 AM
To: statalist@hsphsun2.harvard.edu
Subject: Re: st: data management - string function

Hi,

I just hope that this program will not manage banking accounts,
otherwize someone like Dr Mroz
(http://www.unc.edu/~mroz/index_files/vita_mroz_2007_August%5B2%5D.pdf)
will loose all his savings. The program should be very careful about
replacing the combinations of letters. When there is no guarantee,
that "Mr." is always spelled with a dot (like in the original data
sample in the first email in this thread) spaces should be
incorporated, but even then there is no way you can be sure that Mr is
not a lastname. E.g. the common Asian last name "Ng" (e.g.
http://www.drdavidng.com/contact_us.html) would not qualify many naive
validators (very short, no vowels). Perhaps in some languages "Mr" is
also a name, lastname or a middle name.

Also the choice of titles should probably be wider, to allow e.g. for
Dr., Prof., Col., or any combination of these (which can occur in
multiple combinations like "The life and activities of Col. Prof. Dr.
Jezdimir STUDIC"  here:
http://www.ncbi.nlm.nih.gov/pubmed/14447887?ordinalpos=1&itool=EntrezSystem2.PEntrez.Pubmed.Pubmed_ResultsPanel.Pubmed_DiscoveryPanel.Pubmed_Discovery_RA&linkpos=1&log$=relatedarticles&logdbfrom=pubmed)

Some of the titles are listed here:
http://ecs.victoria.ac.nz/Groups/AI/TitleGeneratorTitles but more
extensive lists can be found in the internet.
Careless replacing of "Master", "Marquis" or "Baron" might leave some
of the people in your list without a lastname.

The only way to be sure about the title is to ask for it separately
while collecting the data.

Best regards, Sergiy Radyakin


On Wed, Dec 24, 2008 at 3:37 AM, Ashim Kapoor <ashimkapoor@gmail.com> wrote:
> About your 2nd query.
>
> Step 1 : gen gender = word(var1,1)
>
> Then do
>
> replace gender="F" if gender=="Mrs"
> replace gender="F" if gender=="Ms"
> replace gender="M" if gender=="Mr"
> replace gender="M" if gender=="Mrs"
>
> Trouble , what if you have Mr. ( notice the dot ) in place of Mr
>
> So we do
>
> replace gender="F" if gender=="Mrs."
> replace gender="F" if gender=="Ms."
> replace gender="M" if gender=="Mr."
>
> I think this should do it.
>
> Merry Xmas to you.
>
> Ashim.
>
> On Wed, Dec 24, 2008 at 2:03 PM, Ashim Kapoor <ashimkapoor@gmail.com> wrote:
>> Hello!
>>
>> I think you want to do this :--
>>
>> gen j=var1
>>
>> gen j2=subinstr(j,"Mrs","",1)
>> gen  j3=subinstr(j2,"Mr","",1)
>> gen j4=susinstr(j3,"Ms","",1)
>>
>> Note  the order of j2 and j3 , it is needed because we have Mr as as
>> subsitring of Mrs. It would be ruined if you did it the other way.
>>
>> I hope you liked it.
>>
>> Thank you,
>> Ashim.
>>
>>
>> On Wed, Dec 24, 2008 at 1:22 PM, b. water <barleywater@hotmail.com> wrote:
>>> dear all,
>>>
>>> stata 8.2, windows xp,
>>>
>>> i have a data management problem: have a variable (strings) of names like these:
>>>
>>> var1
>>> Mrs A Jones
>>> Mrs Anne Jones
>>> Ms Abra Ham
>>> Mr Ko Jack
>>> Jack Kroll
>>> No Probs
>>> Ms. Abra Ham
>>> Mr. Ko Jack
>>> . <- denotes missing
>>> .
>>> .
>>> Miss. Wonder Full
>>> Mrs Bond Trader
>>>
>>> i want to generate new variable which removed the person's title, so it appear like these:
>>>
>>> var2
>>> A Jones
>>> Anne Jones
>>> Abra Ham
>>> Ko Jack
>>> Jack Kroll
>>> No Probs
>>> Abra Ham
>>> Ko Jack
>>> . <- denotes missing
>>> .
>>> .
>>> Wonder Full
>>> Bond Trader
>>>
>>> i tried (thinking that i would slowly truncate Mr, Mrs, Ms title by title):
>>>
>>> gen var2=var1
>>> replace  var2=subinstr("Mr","Mr","",.) <- just as well i generate var2 as this command wiped out all the names!
>>>
>>> i want to also generate another variable that will assign gender based on the title of the name in var 1 i.e. if Mr or Mr. then M(ale) and if Mrs, Mrs., Ms, Ms., Miss, Miss. then F(emale). i thought generate/replace or replace/if using string functions would help but i think this require loop of a sort to achieve.
>>>
>>> F
>>> F
>>> F
>>> M
>>> .
>>> .
>>> F
>>> M
>>> .
>>> .
>>> .
>>> F
>>> F
>>>
>>> thank for advice/help.
>>>
>>> season's greetings,
>>> bw
>>>
>>>
>>>
>>>
>>>
>>> _________________________________________________________________
>>> It's the same Hotmail(R). If by "same" you mean up to 70% faster.
>>> http://windowslive.com/online/hotmail?ocid=TXT_TAGLM_WL_hotmail_acq_broad1_122008
>>> *
>>> *   For searches and help try:
>>> *   http://www.stata.com/help.cgi?search
>>> *   http://www.stata.com/support/statalist/faq
>>> *   http://www.ats.ucla.edu/stat/stata/
>>>
>>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index