[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: data management - string function

From	Steven Samuels <[email protected]>
To	[email protected]
Subject	Re: st: data management - string function
Date	Thu, 25 Dec 2008 04:21:00 -0500
If bw, the original poster, acquires one of the title lists that  
Sergiy referred to,  -regexm- statement with no more than one title,  
plus ending variants, on a line may be useful. There are two kinds of  
titles, those written out ("Mister" "Doctor" "Colonel") and their  
abbreviations ("Mr." "Dr."), which may, in error, exclude the  
period.  I wrote the following code which puts each written=out title  
on one line and the abbreviations on another. Note the alternative  
indicator "|" at the beginning and end of successive lines.  These  
have proved necessary on lines containing multiple abbreviations  
(e.g. "^mr(\.| )" and are harmless on other lines. Therefore I  
include them on all. The code also attempts to cope with some  
possible scenarios that b.w. may encounter: no title, successive  
spaces, spaces before the title,no space after). Note, that in  
American English, "Missy" is a first name. Be sure to zap gremlins  
before using.

A small point: Many titles are gender-neutral, so an effort to  
determine gender from title will produce many missing values.

-Steve

**************************code begins**************************
** Do file to remove titles: Version 4
clear
input str40 name
"Mr John Smith"
"Mr. John Jones"
" Mr Donald Trump"
"Mrs. Felicia Mroz"
"Mrumph Caliph"
" Dr.    Tom Lester "
"drummond katz"
"John Amro"
"Mr.Tim Donner"
"Mister D.D. Smith"
"Doctor Nicholas J. Cox"
"Ms. Virginia Wolfe"
"Ms Jane Austen"
"Missy Columbine"
"Miss Sadie Thompson"
end
gen namex =trim(lower(name))

#delim ;
gen str30 name_only = trim(proper(regexr(namex,
"^mister|
|^mr(\.| )|
|^mistress|
|^mrs(\.| )|
|^doctor|
|^dr(\.| )|
|^miss |
|^ms(\.| )"
," ")));
#delim cr
list name name_only
***************************code ends***************************


On Dec 24, 2008, at 2:53 PM, Steven Samuels wrote:
>
> I agree with everything that Sergiy wrote. A technical point: in  
> Howie's code, the "^" must be inside the quotes.  Here's some code  
> I tried for fun.
>
> -Steve
>
> **************************CODE BEGINS**************************
> clear
> drop _all
> input str40 name
> ""Mr.Tim Donner"
> "Mister D.D. Smith"
> "Doctor Nicholas J. Cox"
> "Ms. Virginia Wolfe"
> end
>
> gen namex = trim(lower(name))
> #delim ;
> gen str30 name_only = proper(regexr(namex,"(^mr(\.| |s |s\.))|(^dr 
> (\.| ))
> |(^mister)|(^doctor) |(^ms(\.| ))",""));
> #delim cr
> ***************************CODE ENDS***************************
>
>
>
> On Dec 24, 2008, at 11:48 AM, Howard Lempel wrote:
>>
>> Following from Sergiy's advice, I'd like to suggest that bw use  
>> regular expressions to only delete occurrences of Mr, Dr, etc.  
>> that occur at the beginning of a name.  This should save Dr. Mroz  
>> (or someone with last name Mr) from being deleted.  Someone with a  
>> first name of MR will still be in trouble (you may want to  
>> experiment with finding a way to only deleting titles from people  
>> where var1 is at least three words, saving someone with first name  
>> MR and no title in the data).
>>
>>
>> BW, carrot (^) tells Stata you are searching for characters at the  
>> beginning of a string only, so you probably want something to the  
>> effect of:
>>
>>
>>> -----Original Message-----
>>> From: [email protected] [mailto:owner- 
>>> [email protected]] On Behalf Of Sergiy Radyakin
>>> Sent: Wednesday, December 24, 2008 11:30 AM
>>> To: [email protected]
>>> Subject: Re: st: data management - string function
>>>
>>> Hi,
>>>
>>> I just hope that this program will not manage banking accounts,
>>> otherwize someone like Dr Mroz
>>> (http://www.unc.edu/~mroz/index_files/vita_mroz_2007_August%5B2% 
>>> 5D.pdf)
>>> will loose all his savings. The program should be very careful about
>>> replacing the combinations of letters. When there is no guarantee,
>>> that "Mr." is always spelled with a dot (like in the original data
>>> sample in the first email in this thread) spaces should be
>>> incorporated, but even then there is no way you can be sure that  
>>> Mr is
>>> not a lastname. E.g. the common Asian last name "Ng" (e.g.
>>> http://www.drdavidng.com/contact_us.html) would not qualify many  
>>> naive
>>> validators (very short, no vowels). Perhaps in some languages  
>>> "Mr" is
>>> also a name, lastname or a middle name.
>>>
>>> Also the choice of titles should probably be wider, to allow e.g.  
>>> for
>>> Dr., Prof., Col., or any combination of these (which can occur in
>>> multiple combinations like "The life and activities of Col. Prof.  
>>> Dr.
>>> Jezdimir STUDIC"  here:
>>> http://www.ncbi.nlm.nih.gov/pubmed/14447887? 
>>> ordinalpos=1&itool=EntrezSystem2.PEntrez.Pubmed.Pubmed_ResultsPanel. 
>>> Pubmed_DiscoveryPanel.Pubmed_Discovery_RA&linkpos=1&log 
>>> $=relatedarticles&logdbfrom=pubmed)
>>>
>>> Some of the titles are listed here:
>>> http://ecs.victoria.ac.nz/Groups/AI/TitleGeneratorTitles but more
>>> extensive lists can be found in the internet.
>>> Careless replacing of "Master", "Marquis" or "Baron" might leave  
>>> some
>>> of the people in your list without a lastname.
>>>
>>> The only way to be sure about the title is to ask for it separately
>>> while collecting the data.
>>>
>>> Best regards, Sergiy Radyakin
>>>
>>
>> On Wed, Dec 24, 2008 at 3:37 AM, Ashim Kapoor  
>> <[email protected]> wrote:

>>
>>> About your 2nd query.
>>>
>>>>
>>>> Step 1 : gen gender = word(var1,1)
>>>>
>>>> Then do
>>>>
>>>> replace gender="F" if gender=="Mrs"
>>>> replace gender="F" if gender=="Mrs."
>>>>
>>>> On Wed, Dec 24, 2008 at 1:22 PM, b. water  
>>>> <[email protected]> wrote:
>>>>> dear all,
>>>>>
>>>>> stata 8.2, windows xp,
>>>>>
>>>>> i have a data management problem: have a variable (strings) of  
>>>>> names like these:
>>>>>
>>>>> var1
>>>>> Mrs A Jones
>>>>> Mrs Anne Jones
>>>>> Ms Abra Ham
>>>>> Mr Ko Jack
>>>>> Jack Kroll
>>>>> . <- denotes missing
>>>>> .
>>>>> .
>>>>> Miss. Wonder Full
>>>>> Mrs Bond Trader
>>>>>
>>>>> i want to generate new variable which removed the person's  
>>>>> title, so it appear like these:
>>>>>
>>>>> var2
>>>>> A Jones
>>>>> Anne Jones
>>>>> Abra Ham
>>>>> Ko Jack
>>>>> Jack Kroll
>>>>> No Probs
>>>>> Abra Ham
>>>>> Ko Jack
>>>>> . <- denotes missing
>>>>> .
>>>>> .
>>>>> Wonder Full
>>>>> Bond Trader
>>>>>>>>> i want to also generate another variable that will assign  
>>>>>>>>> gender based on the title of the name
>>>>>
>>>>>

Steven Samuels
845-246-0774
18 Cantine's Island
Saugerties, NY 12477
EFax: 208-498-7441





*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
Prev by Date: st: predicted values after -lpoly-
Next by Date: Re: st: predicted values after -lpoly-
Previous by thread: Re: st: data management - string function
Next by thread: st: Extracting Numbers from Strings
Index(es):
- Date
- Thread