[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
<Sim.Oertel@t-online.de> |

To |
<statalist@hsphsun2.harvard.edu> |

Subject |
AW: st: data problem - duplicates |

Date |
Fri, 13 Jun 2008 08:30:13 +0200 |

Thank you Salah and Phil for your answers and advice (sorry for my late reply). (1) Soundex works well, however I think the information provided by soundex are not detailed enough for my purpose. (2) The "edit distance" approach sounds promising. However during my first tries I had some problems to fit the methodology to my dataset. I will keep on trying and would appreciate if I could get back to you for further questions. Simon -----Ursprüngliche Nachricht----- Von: owner-statalist@hsphsun2.harvard.edu [mailto:owner-statalist@hsphsun2.harvard.edu] Im Auftrag von Phil Schumm Gesendet: Dienstag, 3. Juni 2008 16:02 An: statalist@hsphsun2.harvard.edu Betreff: Re: st: data problem - duplicates On Jun 3, 2008, at 7:51 AM, <Sim.Oertel@t-online.de> wrote: > "Name" is the only variable which I can use to select duplicates. I > know that there are ways and programs which are able to define a > kind of "similarity-index" which holds information about how > similar two or more variables are on the basis of counting the > different characters between the variables. A common way to approach this is with the concept of "edit distance," which is the minimum number of operations required to transform one string into another (also known as the Levenshtein distance). I've never implemented this in Stata myself, but a program was posted to Statalist several years ago: http://www.stata.com/statalist/archive/2002-08/msg00436.html -- Phil * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**References**:**Re: st: data problem - duplicates***From:*Phil Schumm <pschumm@uchicago.edu>

- Prev by Date:
**Re: st: RE: RE: date function error?** - Next by Date:
**st: Stata licensing query** - Previous by thread:
**Re: st: data problem - duplicates** - Next by thread:
**st: correlation in a bivariate probit model** - Index(es):

© Copyright 1996–2015 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |