Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: RE: Re: catching typos


From   "Martin Weiss" <martin.weiss1@gmx.de>
To   <statalist@hsphsun2.harvard.edu>
Subject   st: RE: Re: catching typos
Date   Tue, 6 Oct 2009 20:57:46 +0200

<>
http://www.stata-journal.com/article.html?article=dm0039


HTH
Martin


-----Original Message-----
From: owner-statalist@hsphsun2.harvard.edu
[mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of Matthias Wasser
Sent: Dienstag, 6. Oktober 2009 20:53
To: statalist@hsphsun2.harvard.edu
Subject: st: Re: catching typos

I'm working with a dataset of several million observations identified
by, among other things, string variables. I have a list against which
I check these to determine if they belong to a certain category. So
far, so good.

What I would like to do is catch typos, so that "Republic of Frrance"
gets caught by "Republic of France" or whatever. Simon Moore had a
similar request
(http://www.stata.com/statalist/archive/2008-08/msg00467.html); like
him, I occassionally have multiple words per string, but the kind
responses to his post assume (if I read them correctly) that there are
just a few likely substitutions, while I have a couple hundred "red
lion" equivalents and no idea of what the likely typos for them are.
The Giuliano code might work, though, even if I don't understand its
internals. Is Levenshtein distance generally considered the best way
to search for typos? What edit distance is generally considered
appropriate?

Thanks so much in advance.
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index