Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: identifying strings that differ on one or two letters


From   Eric Booth <[email protected]>
To   "<[email protected]>" <[email protected]>
Subject   Re: st: identifying strings that differ on one or two letters
Date   Mon, 22 Nov 2010 05:55:25 +0000

<>

I agree with Clyde about trying -reclink- .  I've had some success with cleaning up data using this program, but would help if there is some cleanup you can do up front.  Your data might be too big to do this, but it might help reduce the variation with some statements like:

****!
replace test = lower(trim(test))
replace test = "Jayanth Chemicals" if index(test, "jay")
replace test = "Ford Motor Company" if index(test, "ford")
*** ... reinstate the capitalization ***
replace test = proper(test)
****!

Others in this thread have suggested some non-Stata solutions, so here's one more:  Try the Google Refine API.  See the first video on this page for a demo of some data clean up  by sorting similar categories in a column.  You may need to sort and import the data in chunks if there are size limitations for this API (?) 

- Eric

__
Eric A. Booth
Public Policy Research Institute
Texas A&M University
[email protected]
Office: +979.845.6754


On Nov 20, 2010, at 12:42 PM, Clyde Schechter wrote:

> <>
> 
> As Nick Cox has said, this is difficult to automate.  But from your
> description, I think that Michael Blasnik's reclink package would probably
> get you very close to what you want, and then you could fix by hand the
> remaining problem cases.
> 
> -findit reclink-
> 
> Hope this helps.
> 
> Clyde Schechter, MA MD
> Associate Professor of Family & Social Medicine
> 
> Please note new e-mail address: [email protected]
> 
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/




*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index