Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Repeated names in a string variable, but some have typos. How to correct?
From
"Dimitriy V. Masterov" <[email protected]>
To
Statalist <[email protected]>
Subject
Re: st: Repeated names in a string variable, but some have typos. How to correct?
Date
Fri, 4 Apr 2014 11:26:10 -0700
Lucas,
If I understood your problem, you can try something like this using
Julian Reif's strgroup:
ssc install strgroup
bys city: strgroup street_name, gen(group) threshold(0.25)
city street_name number~s group
A Rua Santos Dumont 1200 1
A Rua Santos Dummont 30 1
A Rua Satos Dumont 3 1
A Rua Bandim 60 2
B Rua Pedro Alvares Cabral 4000 3
B Rua Pedro Alvaers Cabral 3 3
B Rue Pedro Alvares Cabral 1 3
B Av. Pedro Alvares Cabral 20 3
B Rua other 45 4
This relies on the city name having a single correct spelling. If
that's not the case, you can apply this strategy to the city name
first. It won't work with nick names (Frisco for San Francisco, to
give a US example).
You will want to play around with the threshold to match it to you
tolerance for different types of misclassification.
DVM
On Fri, Apr 4, 2014 at 8:29 AM, Lucas Ferreira Mation
<[email protected]> wrote:
> statalisters,
>
> I have a large addresses database, identifying street_names, street_number
> and city, which I need to collapse by street_name and city. Because the
> street_names can have some typos for some street_numbers, when I collapse
> some streets appear duplicated within cities (see example bellow)
> Duplicated street_names between cities would be OK.
>
> Is there a command to do some sort of probabilistic/fuzzy string comparison
> among the rows of a string variable (similar to what reclink does but
> with-in the variable)?
>
> The dataset is quite large, after collapsing I get 2.3 million
> cit-street_name pairs. So I need a smart way to go about it.
>
>
> *Example of the data after collapsing:
> clear
> input str1 city str24 street_name number_of_obs
> "A" "Rua Santos Dumont" 1200
> "A" "Rua Santos Dummont" 30
> "A" "Rua Satos Dumont" 3
> "A" "Rua Bandim" 60
> "B" "Rua Pedro Alvares Cabral" 4000
> "B" "Rua Pedro Alvaers Cabral" 3
> "B" "Rue Pedro Alvares Cabral" 1
> "B" "Av. Pedro Alvares Cabral" 20
> "B" "Rua other" 45
> end
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/faqs/resources/statalist-faq/
> * http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/