[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
"Salah Mahmud" <salah.mahmud@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: data problem - duplicates |

Date |
Tue, 3 Jun 2008 08:29:17 -0500 |

egen soundex function might be a good starting point For an example, see http://www.stata.com/statalist/archive/2002-11/msg00480.html salah mahmud On Tue, Jun 3, 2008 at 7:51 AM, <Sim.Oertel@t-online.de> wrote: > Dear all, > > I would like to select (and later delete) duplicates from a dataset. > However, some duplicates can not be recognized by STATA, because some > variables in my dataset have a poor data-quality. The analysis of the > duplicates is based on a string variable "name". > > Simplified, my dataset looks like this: > > Name var1 var2 > > Peter Enterprises 1 2 > PeterEnterprises 1 2 > Peter!Enterprises 1 2 > Geter Enterprises 1 2 > > > "Name" is the only variable which I can use to select duplicates. I know > that there are ways and programs which are able to define a kind of > "similarity-index" which holds information about how similar two or more > variables are on the basis of counting the different characters between the > variables. > > Concerning my example this means, that each of the four cases above have a > "similarity index" of 1, because only one letter or character has to be > change to make them equal. > > Has anyone an idea how I could define such an index for STATA? My goal is to > use such an index as additional variable, which help me to recheck cases in > which potential duplicates are included. > > Thanks for your suggestions and help. > Simon > > > * > * For searches and help try: > * http://www.stata.com/support/faqs/res/findit.html > * http://www.stata.com/support/statalist/faq > * http://www.ats.ucla.edu/stat/stata/ > * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**References**:**st: data problem - duplicates***From:*<Sim.Oertel@t-online.de>

- Prev by Date:
**st: RE: how to deal with categories?** - Next by Date:
**st: RE: tabstat question** - Previous by thread:
**st: data problem - duplicates** - Next by thread:
**Re: st: data problem - duplicates** - Index(es):

© Copyright 1996–2015 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |