Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: data problem - duplicates


From   <[email protected]>
To   <[email protected]>
Subject   st: data problem - duplicates
Date   Tue, 3 Jun 2008 14:51:32 +0200

Dear all,

I would like to select (and later delete) duplicates from a dataset.
However, some duplicates can not be recognized by STATA, because some
variables in my dataset have a poor data-quality. The analysis of the
duplicates is based on a string variable "name".

Simplified, my dataset looks like this:

Name			  var1		var2

Peter Enterprises	   1		       2
PeterEnterprises	   1		       2
Peter!Enterprises	   1		       2
Geter Enterprises	   1		       2


"Name" is the only variable which I can use to select duplicates. I know
that there are ways and programs which are able to define a kind of
"similarity-index" which holds information about how similar two or more
variables are on the basis of counting the different characters between the
variables. 

Concerning my example this means, that each of the four cases above have a
"similarity index" of 1, because only one letter or character has to be
change to make them equal. 

Has anyone an idea how I could define such an index for STATA? My goal is to
use such an index as additional variable, which help me to recheck cases in
which potential duplicates are included. 

Thanks for your suggestions and help.
Simon


*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index