Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: data management - string function


From   Steven Samuels <sjhsamuels@earthlink.net>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: data management - string function
Date   Wed, 24 Dec 2008 14:53:08 -0500

I agree with everything that Sergiy wrote. A technical point: in Howie's code, the "^" must be inside the quotes. Here's some code I tried for fun.

-Steve

**************************CODE BEGINS**************************
clear
drop _all
input str40 name
"Mr John Smith"
"Mr. John Jones"
" Mr Donald Trump"
"Mrs. Felicia Mroz"
"Mrumph Caliph"
" Dr.    Tom Lester "
"drummond katz"
"John Amro"
"Mr.Tim Donner"
"Mister D.D. Smith"
"Doctor Nicholas J. Cox"
"Ms. Virginia Wolfe"
end

gen namex = trim(lower(name))
#delim ;
gen str30 name_only = proper(regexr(namex,"(^mr(\.| |s |s\.))|(^dr (\.| ))
|(^mister)|(^doctor) |(^ms(\.| ))",""));
#delim cr
list name name_only
***************************CODE ENDS***************************



On Dec 24, 2008, at 11:48 AM, Howard Lempel wrote:

Hi all,

Following from Sergiy's advice, I'd like to suggest that bw use regular expressions to only delete occurrences of Mr, Dr, etc. that occur at the beginning of a name. This should save Dr. Mroz (or someone with last name Mr) from being deleted. Someone with a first name of MR will still be in trouble (you may want to experiment with finding a way to only deleting titles from people where var1 is at least three words, saving someone with first name MR and no title in the data).

I don't have time to write out the full code, but see the regular expression FAQ here: http://www.stata.com/support/faqs/data/regex.html

Also look up -help regexm-

BW, carrot (^) tells Stata you are searching for characters at the beginning of a string only, so you probably want something to the effect of:

Gen var2 = regexr(var1,^("MR" | "MR." | "Mr" | . . .),)

Note: That code is untested, unfinished, and written by someone w/o expertise on regular expressions (e.g. I'd need to look up exactly how the "OR" operator and parentheses work).

Hope this helps.
Howie

-----Original Message-----
From: owner-statalist@hsphsun2.harvard.edu [mailto:owner- statalist@hsphsun2.harvard.edu] On Behalf Of Sergiy Radyakin
Sent: Wednesday, December 24, 2008 11:30 AM
To: statalist@hsphsun2.harvard.edu
Subject: Re: st: data management - string function

Hi,

I just hope that this program will not manage banking accounts,
otherwize someone like Dr Mroz
(http://www.unc.edu/~mroz/index_files/vita_mroz_2007_August%5B2% 5D.pdf)
will loose all his savings. The program should be very careful about
replacing the combinations of letters. When there is no guarantee,
that "Mr." is always spelled with a dot (like in the original data
sample in the first email in this thread) spaces should be
incorporated, but even then there is no way you can be sure that Mr is
not a lastname. E.g. the common Asian last name "Ng" (e.g.
http://www.drdavidng.com/contact_us.html) would not qualify many naive
validators (very short, no vowels). Perhaps in some languages "Mr" is
also a name, lastname or a middle name.

Also the choice of titles should probably be wider, to allow e.g. for
Dr., Prof., Col., or any combination of these (which can occur in
multiple combinations like "The life and activities of Col. Prof. Dr.
Jezdimir STUDIC"  here:
http://www.ncbi.nlm.nih.gov/pubmed/14447887? ordinalpos=1&itool=EntrezSystem2.PEntrez.Pubmed.Pubmed_ResultsPanel.Pu bmed_DiscoveryPanel.Pubmed_Discovery_RA&linkpos=1&log $=relatedarticles&logdbfrom=pubmed)

Some of the titles are listed here:
http://ecs.victoria.ac.nz/Groups/AI/TitleGeneratorTitles but more
extensive lists can be found in the internet.
Careless replacing of "Master", "Marquis" or "Baron" might leave some
of the people in your list without a lastname.

The only way to be sure about the title is to ask for it separately
while collecting the data.

Best regards, Sergiy Radyakin


On Wed, Dec 24, 2008 at 3:37 AM, Ashim Kapoor <ashimkapoor@gmail.com> wrote:
About your 2nd query.

Step 1 : gen gender = word(var1,1)

Then do

replace gender="F" if gender=="Mrs"
replace gender="F" if gender=="Ms"
replace gender="M" if gender=="Mr"
replace gender="M" if gender=="Mrs"

Trouble , what if you have Mr. ( notice the dot ) in place of Mr

So we do

replace gender="F" if gender=="Mrs."
replace gender="F" if gender=="Ms."
replace gender="M" if gender=="Mr."

I think this should do it.

Merry Xmas to you.

Ashim.

On Wed, Dec 24, 2008 at 2:03 PM, Ashim Kapoor <ashimkapoor@gmail.com> wrote:
Hello!

I think you want to do this :--

gen j=var1

gen j2=subinstr(j,"Mrs","",1)
gen  j3=subinstr(j2,"Mr","",1)
gen j4=susinstr(j3,"Ms","",1)

Note  the order of j2 and j3 , it is needed because we have Mr as as
subsitring of Mrs. It would be ruined if you did it the other way.

I hope you liked it.

Thank you,
Ashim.


On Wed, Dec 24, 2008 at 1:22 PM, b. water <barleywater@hotmail.com> wrote:
dear all,

stata 8.2, windows xp,

i have a data management problem: have a variable (strings) of names like these:

var1
Mrs A Jones
Mrs Anne Jones
Ms Abra Ham
Mr Ko Jack
Jack Kroll
No Probs
Ms. Abra Ham
Mr. Ko Jack
. <- denotes missing
.
.
Miss. Wonder Full
Mrs Bond Trader

i want to generate new variable which removed the person's title, so it appear like these:

var2
A Jones
Anne Jones
Abra Ham
Ko Jack
Jack Kroll
No Probs
Abra Ham
Ko Jack
. <- denotes missing
.
.
Wonder Full
Bond Trader

i tried (thinking that i would slowly truncate Mr, Mrs, Ms title by title):

gen var2=var1
replace var2=subinstr("Mr","Mr","",.) <- just as well i generate var2 as this command wiped out all the names!

i want to also generate another variable that will assign gender based on the title of the name in var 1 i.e. if Mr or Mr. then M (ale) and if Mrs, Mrs., Ms, Ms., Miss, Miss. then F(emale). i thought generate/replace or replace/if using string functions would help but i think this require loop of a sort to achieve.

F
F
F
M
.
.
F
M
.
.
.
F
F

thank for advice/help.

season's greetings,
bw





_________________________________________________________________
It's the same Hotmail(R). If by "same" you mean up to 70% faster.
http://windowslive.com/online/hotmail? ocid=TXT_TAGLM_WL_hotmail_acq_broad1_122008
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index