Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: identifying strings that differ on one or two letters


From   "Dimitriy V. Masterov" <[email protected]>
To   [email protected]
Subject   Re: st: identifying strings that differ on one or two letters
Date   Sat, 20 Nov 2010 14:04:24 -0500

Dahlia,

I am not sure what you mean by Windows 2007. What dooes -di c(os)
c(osdtl)- or -di c(machine_type)- produce?

I think Nick is right that you may not be able to automate this in
Stata, but I think I may be able to help you get you close using a
combination of Stata and VBScript in Excel. Here's what I did a few
months back. In Stata, save a copy of your data and run this code:

/* This will pair each company with all the other companies (may not
work very well if you have lots of data) */
keep comp_name
clonevar comp_name2=comp_name
fillin comp_name comp_name2     // Nick's Stata Tip #17!
drop if comp_name==comp_name2 // You may not want to use this step

Then export this data to an Excel file with macros enabled (save it as .xlsm).

You will need to copy the Levenshtein script from here:

http://en.wikibooks.org/wiki/Algorithm_implementation/Strings/Levenshtein_distance#Visual_Basic_for_Applications_.28no_Damerau_extension.29

On the Developer tab in Excel, click the Visual Basic tab. Paste the
VBScript into the window that opens up. The Developer tab may be
hidden, so you may have to google around about how to expose it. Click
save. In the Excel file, use the levenshtein formula for each pair of
company names to get  a column of distances like this:

 =levenshtein(A1,B1).

For example, with you data this should look like this:

Jayanthi chemicals	Jay chemicals	5
Jayanthi chemicals	Jayanth chemicals	1
Jayanth chemicals	Jay chemicals	4
Jayanth chemicals	Jayanthi chemicals	1
Jay chemicals	Jayanth chemicals	4
Jay chemicals	Jayanthi chemicals	5

The lower the distance, the closer the company names. Get the data
back into Stata and sort by company name and distance. You will
probably have to do some more work by hand to determine where your
match threshold is, but having the pairs makes life a lot easier to
get a mapping. I must warn you that it will not work a 100% of the
time, even when the distance is 1-2.

DVM
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index