Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: RE: RE: RE: Re: comparing strings off by one character


From   Colin Hargreaves <[email protected]>
To   "[email protected]" <[email protected]>
Subject   st: RE: RE: RE: Re: comparing strings off by one character
Date   Thu, 13 Mar 2014 10:03:24 +1100

My two cents worth on this -

  1. You cannot assume there is only one observation per year as sometimes companies change their reporting period and you end up with two observations in one calendar year or none.

  2. There are many text comparison measures that treat strings as words and not just a random set of symbols and these would clearly identify your Abraxis example as being the same company.

  3. I have a similar problem with names as people might enter a postnym one year but not bother the next, or they change an initial to the full forename, or they omit one initial, and so on. I have to admit that I have not finished work on this but the tactic is to use the information available in other variables to create a measure of likely match. For instance you may have data on number of employees, location, GSIC code, etc. So my aim is to create a likelihood of match and then use this as a weight when running any estimations. Would welcome any comments on this approach.

Best wishes,

Colin Hargreaves
UNE/IRBFEM
Australia




________________________________________
From: [email protected] [[email protected]] On Behalf Of Brill, Robert [[email protected]]
Sent: 13 March 2014 08:36
To: '[email protected]'
Subject: st: RE: RE: Re: comparing strings off by one character

I've used -strgroup- for a similar project with very good results. The ability to select a  Levenshtein threshold is very useful. Obviously, however, this is a very large dataset, and -strgroup- may have difficulty with that (100,000 pairwise comparisons is a lot), but there are certainly options to subset the data.

Using -strgroup- and then -duplicates tag- with year and group  would seemingly result in very few cases in which there will need to be work done by hand. Especially because each company should (if I understand correctly) have only one observation for each time period.

Best,

Rob Brill
Child and Family Research Partnership
 The University of Texas at Austin

-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Joe Canner
Sent: Wednesday, March 12, 2014 3:58 PM
To: [email protected]
Subject: st: RE: Re: comparing strings off by one character

Maria,

I don't know much about -strgroup-, but it looked interesting so I tried to learn more....

It looks like -strgroup- can group observations based on their Levenshtein distance (given a certain threshold) and assign each set of matches a unique number.  I wonder why you couldn't just use that number to identify companies from here on, instead of having to fix the names so that they match?

Regards,
Joe Canner
Johns Hopkins University School of Medicine





-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Maria Boutchkova
Sent: Wednesday, March 12, 2014 3:39 PM
To: statalist
Subject: st: Re: comparing strings off by one character

Dear Statalisters,

I am dealing with the following problem under the general rubric of string comparison.
I have a string variable name containing company names over 17 years.
Sometimes for some years the data entry person has made a typo and the company name is off from the way it is entered the rest of the years by one character. Initial collapsing by name results in close to 100K unique names, therefore automation is a must.
This post has been very helpful to me so far, but I am not quite there yet.
http://www.stata.com/statalist/archive/2012-03/msg01135.html

Here is what I have so far:

collapse (first) v1 v2 (max) v3, by(name) sort name gen name_prev = name[_n-1] if _n > 1 order name name_prev levenshtein name name_prev, gen(levstein_prev)

After examining the results up to here, I see that whenever the different character is a number, the names are genuinely distinct and I should not correct them. Therefore I was thinking of using

gen char_off_place = indexnot(name,name_prev) if  levstein_prev == 1

and then conditioning my further commands on whether the character off is a number or a letter.

The problem is that indexnot(name,name_prev) doesn't do exactly what I want.
For example:
name is "ABRAXIS BIOSCIENCES INC"
name_prev is "ABRAXIS BIOSCIENCE INC"

in this case, indexnot(string1,string2) will return 0 because the off character (3rd "S" in string1) appears in string2.

It seems like there must be a way to get the position of the off character while observing the order of the characters in string2 Before I give up on Stata and do it in MatLab, can anyone offer a suggestion?

(If there are cases where the first letter of the company name is off, I will deal with this easily later.)

Thank you!
Maria Boutchkova
Lecturer in Finance
University of Edinburgh
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index