Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | Patrick McNamara <patrick.mcnamara@efficiency20.com> |
To | statalist@hsphsun2.harvard.edu |
Subject | Re: st: Extract a letter between numbers |
Date | Mon, 22 Nov 2010 16:21:31 -0500 |
Those both sound like good ideas. Any advice on how to execute them after install? :) To give an idea of what I'm working with, I've listed a correct address and some examples of address problems below: 5654 N Oak St Chicago, Illinois 56e54 Oak st Chicago, Illinois 5654 North Oak Chicago Illinois 5654 No. Oak St 5654 Oak St There may be more than one of these issues present in a single address entry. What I'm trying to do right now is find the length of the first three words after the home address (5654), then use the longest and 2nd longest to see which has a better matching rate. But nearmrg or strgroup may work much better. Patrick On Mon, Nov 22, 2010 at 3:41 PM, Dimitriy V. Masterov <dvmaster@gmail.com> wrote: > I think you may want to fuzzy merge your dirty address data and your > clean data using nearmrg, which you can get from scc. > > An alternative way would to append your two data sets and then use > strgroup on the variable that is the stacked version of your clean and > dirty addresses. That will give you the closest match. > > Neither one will be perfect and may take a long time/fail if you have > too much data. The latter approach has some operating system > restrictions as well. > > DVM > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/statalist/faq > * http://www.ats.ucla.edu/stat/stata/ > * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/