Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: identifying strings that differ on one or two letters

From	"Dimitriy V. Masterov" <[email protected]>
To	[email protected]
Subject	Re: st: identifying strings that differ on one or two letters
Date	Sat, 20 Nov 2010 14:04:24 -0500

Dahlia,

I am not sure what you mean by Windows 2007. What dooes -di c(os)
c(osdtl)- or -di c(machine_type)- produce?

I think Nick is right that you may not be able to automate this in
Stata, but I think I may be able to help you get you close using a
combination of Stata and VBScript in Excel. Here's what I did a few
months back. In Stata, save a copy of your data and run this code:

/* This will pair each company with all the other companies (may not
work very well if you have lots of data) */
keep comp_name
clonevar comp_name2=comp_name
fillin comp_name comp_name2 // Nick's Stata Tip #17!
drop if comp_name==comp_name2 // You may not want to use this step

Then export this data to an Excel file with macros enabled (save it as .xlsm).

You will need to copy the Levenshtein script from here:

http://en.wikibooks.org/wiki/Algorithm_implementation/Strings/Levenshtein_distance#Visual_Basic_for_Applications_.28no_Damerau_extension.29

On the Developer tab in Excel, click the Visual Basic tab. Paste the
VBScript into the window that opens up. The Developer tab may be
hidden, so you may have to google around about how to expose it. Click
save. In the Excel file, use the levenshtein formula for each pair of
company names to get a column of distances like this:

=levenshtein(A1,B1).

For example, with you data this should look like this:

Jayanthi chemicals Jay chemicals 5
Jayanthi chemicals Jayanth chemicals 1
Jayanth chemicals Jay chemicals 4
Jayanth chemicals Jayanthi chemicals 1
Jay chemicals Jayanth chemicals 4
Jay chemicals Jayanthi chemicals 5

The lower the distance, the closer the company names. Get the data
back into Stata and sort by company name and distance. You will
probably have to do some more work by hand to determine where your
match threshold is, but having the pairs makes life a lot easier to
get a mapping. I must warn you that it will not work a 100% of the
time, even when the distance is 1-2.

DVM
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/

References:
- RE: st: identifying strings that differ on one or two letters
  - From: Nick Cox <[email protected]>
- RE: st: identifying strings that differ on one or two letters
  - From: Dalhia <[email protected]>

Prev by Date: Re: st: identifying strings that differ on one or two letters
Next by Date: Re: st: Removing Spaces from Variable Labels
Previous by thread: RE: st: identifying strings that differ on one or two letters
Next by thread: Re: st: identifying strings that differ on one or two letters
Index(es):
- Date
- Thread