Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | Charles Vellutini <charles.vellutini@ecopa.com> |
To | "statalist@hsphsun2.harvard.edu" <statalist@hsphsun2.harvard.edu> |
Subject | Re: st: Fuzzy collapse |
Date | Thu, 26 Jan 2012 14:08:00 -0800 |
Steve, You are right -soundex- seems to work reasonably well on our data (especially if we split the strings into words) even though it is inFrench. In any case, a step forward. Thanks Charles Envoyé de mon iPhone Le 26 janv. 2012 à 18:10, "Steve Nakoneshny" <scnakone@ucalgary.ca> a écrit : > Charles, > > I agree with you that -soundex- may not be appropriate given assumptions about English background, but it may still be a reasonable option to try given the similarities of the strings you provided as examples (despite being French). It may or may not work though. > > Steve > > On 2012-01-26, at 9:58 AM, Charles Vellutini wrote: > >> Thanks Steve. I will try that -- although my impression is that the Stata implementation of -soundex- is based on English, right? Not sure about French keywords (my data in this particular case). >> >> Charles >> >> -----Message d'origine----- >> De : owner-statalist@hsphsun2.harvard.edu [mailto:owner-statalist@hsphsun2.harvard.edu] De la part de Steve Nakoneshny >> Envoyé : jeudi 26 janvier 2012 17:18 >> À : statalist@hsphsun2.harvard.edu >> Objet : Re: st: Fuzzy collapse >> >> Although I don't have any experience in using it, I would think that this situation may call for -soundex-, check -help string functions- for details. >> >> >> Steve >> >> On 2012-01-26, at 9:03 AM, Charles Vellutini wrote: >> >>> Dear Statalisters, >>> >>> I have a string variable holding thousands of search keywords, many of them identical up to a few accents or characters. Here is a typical sample: >>> >>> keyword >>> >>> Obs1 télécommandes >>> Obs2 télecommandes >>> Obs3 télécomandes >>> Obs4 telecommandes >>> Obs5 télécommande >>> etc. >>> >>> I would like to do a "fuzzy" collapse, that is, to group observations with near-identical keywords. I could write manual multiple -replace-'s to harmonize keywords, but given the size of the dataset and the variety of keywords, that is hardly feasible. I am aware of the -reclink- user-written command for fuzzy merging but that is for merging two datasets, not for collapsing observations within a dataset. It is not immediately evident to me how I could use -reclink- to solve my problem, but maybe that is feasible? >>> >>> Any suggestion much appreciated. >>> >>> Thanks, >>> Charles >>> >>> >>> >>> >>> >>> * >>> * For searches and help try: >>> * http://www.stata.com/help.cgi?search >>> * http://www.stata.com/support/statalist/faq >>> * http://www.ats.ucla.edu/stat/stata/ >> >> >> * >> * For searches and help try: >> * http://www.stata.com/help.cgi?search >> * http://www.stata.com/support/statalist/faq >> * http://www.ats.ucla.edu/stat/stata/ >> >> >> >> * >> * For searches and help try: >> * http://www.stata.com/help.cgi?search >> * http://www.stata.com/support/statalist/faq >> * http://www.ats.ucla.edu/stat/stata/ > > > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/statalist/faq > * http://www.ats.ucla.edu/stat/stata/ > * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/