[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: data problem - duplicates

From	Phil Schumm <[email protected]>
To	[email protected]
Subject	Re: st: data problem - duplicates
Date	Tue, 3 Jun 2008 09:02:23 -0500

On Jun 3, 2008, at 7:51 AM, <[email protected]> wrote:

"Name" is the only variable which I can use to select duplicates. I know that there are ways and programs which are able to define a kind of "similarity-index" which holds information about how similar two or more variables are on the basis of counting the different characters between the variables.

A common way to approach this is with the concept of "edit distance," which is the minimum number of operations required to transform one string into another (also known as the Levenshtein distance). I've never implemented this in Stata myself, but a program was posted to Statalist several years ago:

http://www.stata.com/statalist/archive/2002-08/msg00436.html

-- Phil

*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- AW: st: data problem - duplicates
  - From: <[email protected]>

References:
- st: data problem - duplicates
  - From: <[email protected]>

Prev by Date: st: xtlogit residuals, outliers & influential observations
Next by Date: Re: st: xtmixed documentation
Previous by thread: Re: st: data problem - duplicates
Next by thread: AW: st: data problem - duplicates
Index(es):
- Date
- Thread