egen soundex function might be a good starting point For an example, see http://www.stata.com/statalist/archive/2002-11/msg00480.html salah mahmud On Tue, Jun 3, 2008 at 7:51 AM, <Sim.Oertel@t-online.de> wrote: > Dear all, > > I would like to select (and later delete) duplicates from a dataset. > However, some duplicates can not be recognized by STATA, because some > variables in my dataset have a poor data-quality. The analysis of the > duplicates is based on a string variable "name". > > Simplified, my dataset looks like this: > > Name var1 var2 > > Peter Enterprises 1 2 > PeterEnterprises 1 2 > Peter!Enterprises 1 2 > Geter Enterprises 1 2 > > > "Name" is the only variable which I can use to select duplicates. I know > that there are ways and programs which are able to define a kind of > "similarity-index" which holds information about how similar two or more > variables are on the basis of counting the different characters between the > variables. > > Concerning my example this means, that each of the four cases above have a > "similarity index" of 1, because only one letter or character has to be > change to make them equal. > > Has anyone an idea how I could define such an index for STATA? My goal is to > use such an index as additional variable, which help me to recheck cases in > which potential duplicates are included. > > Thanks for your suggestions and help. > Simon > > > * > * For searches and help try: > * http://www.stata.com/support/faqs/res/findit.html > * http://www.stata.com/support/statalist/faq > * http://www.ats.ucla.edu/stat/stata/ > * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

