[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: data management - string function

From	Steven Samuels <[email protected]>
To	[email protected]
Subject	Re: st: data management - string function
Date	Wed, 24 Dec 2008 14:53:08 -0500

I agree with everything that Sergiy wrote. A technical point: inHowie's code, the "^" must be inside the quotes. Here's some code Itried for fun.


-Steve

**************************CODE BEGINS**************************
clear
drop _all
input str40 name
"Mr John Smith"
"Mr. John Jones"
" Mr Donald Trump"
"Mrs. Felicia Mroz"
"Mrumph Caliph"
" Dr.    Tom Lester "
"drummond katz"
"John Amro"
"Mr.Tim Donner"
"Mister D.D. Smith"
"Doctor Nicholas J. Cox"
"Ms. Virginia Wolfe"
end

gen namex = trim(lower(name))
#delim ;

gen str30 name_only = proper(regexr(namex,"(^mr(\.| |s |s\.))|(^dr(\.| ))

|(^mister)|(^doctor) |(^ms(\.| ))",""));
#delim cr
list name name_only
***************************CODE ENDS***************************



On Dec 24, 2008, at 11:48 AM, Howard Lempel wrote:

Hi all,
Following from Sergiy's advice, I'd like to suggest that bw useregular expressions to only delete occurrences of Mr, Dr, etc. thatoccur at the beginning of a name. This should save Dr. Mroz (orsomeone with last name Mr) from being deleted. Someone with afirst name of MR will still be in trouble (you may want toexperiment with finding a way to only deleting titles from peoplewhere var1 is at least three words, saving someone with first nameMR and no title in the data).
I don't have time to write out the full code, but see the regularexpression FAQ here: http://www.stata.com/support/faqs/data/regex.html
Also look up -help regexm-
BW, carrot (^) tells Stata you are searching for characters at thebeginning of a string only, so you probably want something to theeffect of:
Gen var2 = regexr(var1,^("MR" | "MR." | "Mr" | . . .),)
Note: That code is untested, unfinished, and written by someone w/oexpertise on regular expressions (e.g. I'd need to look up exactlyhow the "OR" operator and parentheses work).
Hope this helps.
Howie

-----Original Message-----
From: [email protected] [mailto:owner-[email protected]] On Behalf Of Sergiy Radyakin
Sent: Wednesday, December 24, 2008 11:30 AM
To: [email protected]
Subject: Re: st: data management - string function

Hi,

I just hope that this program will not manage banking accounts,
otherwize someone like Dr Mroz
(http://www.unc.edu/~mroz/index_files/vita_mroz_2007_August%5B2%5D.pdf)
will loose all his savings. The program should be very careful about
replacing the combinations of letters. When there is no guarantee,
that "Mr." is always spelled with a dot (like in the original data
sample in the first email in this thread) spaces should be
incorporated, but even then there is no way you can be sure that Mr is
not a lastname. E.g. the common Asian last name "Ng" (e.g.
http://www.drdavidng.com/contact_us.html) would not qualify many naive
validators (very short, no vowels). Perhaps in some languages "Mr" is
also a name, lastname or a middle name.

Also the choice of titles should probably be wider, to allow e.g. for
Dr., Prof., Col., or any combination of these (which can occur in
multiple combinations like "The life and activities of Col. Prof. Dr.
Jezdimir STUDIC"  here:
http://www.ncbi.nlm.nih.gov/pubmed/14447887?ordinalpos=1&itool=EntrezSystem2.PEntrez.Pubmed.Pubmed_ResultsPanel.Pubmed_DiscoveryPanel.Pubmed_Discovery_RA&linkpos=1&log$=relatedarticles&logdbfrom=pubmed)
Some of the titles are listed here:
http://ecs.victoria.ac.nz/Groups/AI/TitleGeneratorTitles but more
extensive lists can be found in the internet.
Careless replacing of "Master", "Marquis" or "Baron" might leave some
of the people in your list without a lastname.

The only way to be sure about the title is to ask for it separately
while collecting the data.

Best regards, Sergiy Radyakin
On Wed, Dec 24, 2008 at 3:37 AM, Ashim Kapoor<[email protected]> wrote:
About your 2nd query.

Step 1 : gen gender = word(var1,1)

Then do

replace gender="F" if gender=="Mrs"
replace gender="F" if gender=="Ms"
replace gender="M" if gender=="Mr"
replace gender="M" if gender=="Mrs"

Trouble , what if you have Mr. ( notice the dot ) in place of Mr

So we do

replace gender="F" if gender=="Mrs."
replace gender="F" if gender=="Ms."
replace gender="M" if gender=="Mr."

I think this should do it.

Merry Xmas to you.

Ashim.
On Wed, Dec 24, 2008 at 2:03 PM, Ashim Kapoor<[email protected]> wrote:
Hello!

I think you want to do this :--

gen j=var1

gen j2=subinstr(j,"Mrs","",1)
gen  j3=subinstr(j2,"Mr","",1)
gen j4=susinstr(j3,"Ms","",1)

Note  the order of j2 and j3 , it is needed because we have Mr as as
subsitring of Mrs. It would be ruined if you did it the other way.

I hope you liked it.

Thank you,
Ashim.
On Wed, Dec 24, 2008 at 1:22 PM, b. water<[email protected]> wrote:
dear all,

stata 8.2, windows xp,
i have a data management problem: have a variable (strings) ofnames like these:
var1
Mrs A Jones
Mrs Anne Jones
Ms Abra Ham
Mr Ko Jack
Jack Kroll
No Probs
Ms. Abra Ham
Mr. Ko Jack
. <- denotes missing
.
.
Miss. Wonder Full
Mrs Bond Trader
i want to generate new variable which removed the person'stitle, so it appear like these:
var2
A Jones
Anne Jones
Abra Ham
Ko Jack
Jack Kroll
No Probs
Abra Ham
Ko Jack
. <- denotes missing
.
.
Wonder Full
Bond Trader
i tried (thinking that i would slowly truncate Mr, Mrs, Ms titleby title):
gen var2=var1
replace var2=subinstr("Mr","Mr","",.) <- just as well igenerate var2 as this command wiped out all the names!
i want to also generate another variable that will assign genderbased on the title of the name in var 1 i.e. if Mr or Mr. then M(ale) and if Mrs, Mrs., Ms, Ms., Miss, Miss. then F(emale). ithought generate/replace or replace/if using string functionswould help but i think this require loop of a sort to achieve.
F
F
F
M
.
.
F
M
.
.
.
F
F

thank for advice/help.

season's greetings,
bw





_________________________________________________________________
It's the same Hotmail(R). If by "same" you mean up to 70% faster.
http://windowslive.com/online/hotmail?ocid=TXT_TAGLM_WL_hotmail_acq_broad1_122008
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

References:
- st: data management - string function
  - From: "b. water" <[email protected]>
- Re: st: data management - string function
  - From: "Ashim Kapoor" <[email protected]>
- Re: st: data management - string function
  - From: "Ashim Kapoor" <[email protected]>
- Re: st: data management - string function
  - From: "Sergiy Radyakin" <[email protected]>
- RE: st: data management - string function
  - From: Howard Lempel <[email protected]>

Prev by Date: Re: st: Extracting Numbers from Strings
Next by Date: Re: st: Extracting Numbers from Strings
Previous by thread: RE: st: data management - string function
Next by thread: Re: st: data management - string function
Index(es):
- Date
- Thread