Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE:st: strpos: cleaning string variables (statalist-digest V4 #4072)


From   "Allan Reese (Cefas)" <allan.reese@cefas.co.uk>
To   <statalist@hsphsun2.harvard.edu>
Subject   RE:st: strpos: cleaning string variables (statalist-digest V4 #4072)
Date   Tue, 22 Feb 2011 09:54:56 -0000

Emily Farchy <emily.farchy@sant.ox.ac.uk> asked:
> I would like to run a loop to transform some messy names into cleaner
versions eg
> "Drs. H. Muslim Kasim, Ak"--> "Muslim Kasim"
> "Drs. Makmur Syahputra, Sh" -->"Makmur Syah Putra"
> "H Amru Helmy Daulay Sh" --> "Amru Helmy Daulay"
------------------------------------------------

I don't see where the "loop" comes in.  These look like one-off value
replacements without any general pattern.  The -replace- command will
make the same replacement in different observations.  That's implicitly
"looping" through obs.

Faced with a similar problem many years ago, I found a practical
solution was to read the names (in my case job titles from 19th century
documents) as a string variable and -encode- them so each variant was
allocated a number and became the label.  This generated a list of all
the variants and we could infer using RI (real intelligence, as opposed
to AI) which were equivalent or alternative spellings.

It's easier to collapse categories that are numbers rather than strings,
and the requested transformations could be achieved by 
-label x "new name", modify-.

Usual advice to -save- to new file so work can be undone and repeated.
Put the commands in a .do file and you have an audit trail of changes.

People's names are messy!  Very culturally dependent and context
specific.

Regards
Allan / R. Allan REESE / Mr Reese




***********************************************************************************
This email and any attachments are intended for the named recipient only.  Its unauthorised use, distribution, disclosure, storage or copying is not permitted.  If you have received it in error, please destroy all copies and notify the sender.  In messages of a non-business nature, the views and opinions expressed are the author's own and do not necessarily reflect those of the organisation from which it is sent.  All emails may be subject to monitoring.
***********************************************************************************


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index