Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Repeated names in a string variable, but some have typos. How to correct?


From   Lucas Ferreira Mation <[email protected]>
To   statalist <[email protected]>
Subject   Re: st: Repeated names in a string variable, but some have typos. How to correct?
Date   Mon, 14 Apr 2014 10:16:01 -0300

Thank you Dimitriy,
it works. Municipal codes are fine.
The actuall dataset is quite large (all street names in Brazil), so it
took 2,5 days to run on the server.




On Fri, Apr 4, 2014 at 3:26 PM, Dimitriy V. Masterov <[email protected]> wrote:
> Lucas,
>
> If I understood your problem, you can try something like this using
> Julian Reif's strgroup:
>
> ssc install strgroup
> bys city: strgroup street_name, gen(group) threshold(0.25)
>
>   city                street_name   number~s   group
>        A          Rua Santos Dumont       1200       1
>        A         Rua Santos Dummont         30       1
>        A           Rua Satos Dumont          3       1
>        A                 Rua Bandim         60       2
>        B   Rua Pedro Alvares Cabral       4000       3
>        B   Rua Pedro Alvaers Cabral          3       3
>        B   Rue Pedro Alvares Cabral          1       3
>        B   Av. Pedro Alvares Cabral         20       3
>        B                  Rua other         45       4
>
> This relies on the city name having a single correct spelling. If
> that's not the case, you can apply this strategy to the city name
> first. It won't work with nick names (Frisco for San Francisco, to
> give a US example).
>
> You will want to play around with the threshold to match it to you
> tolerance for different types of misclassification.
>
> DVM
>
> On Fri, Apr 4, 2014 at 8:29 AM, Lucas Ferreira Mation
> <[email protected]> wrote:
>> statalisters,
>>
>> I have a large addresses database, identifying street_names,  street_number
>> and city, which I need to collapse by street_name and city. Because the
>> street_names can have some typos for some street_numbers, when I collapse
>> some streets appear duplicated within cities (see example bellow)
>> Duplicated street_names between cities would be OK.
>>
>> Is there a command to do some sort of probabilistic/fuzzy string comparison
>> among the rows of a string variable (similar to what reclink does but
>> with-in the variable)?
>>
>> The dataset is quite large, after collapsing I get 2.3 million
>> cit-street_name pairs. So I need a smart way to go about it.
>>
>>
>> *Example of the data after collapsing:
>> clear
>> input str1 city str24 street_name number_of_obs
>> "A" "Rua Santos Dumont" 1200
>> "A" "Rua Santos Dummont" 30
>> "A" "Rua Satos Dumont" 3
>> "A" "Rua Bandim" 60
>> "B" "Rua Pedro Alvares Cabral" 4000
>> "B" "Rua Pedro Alvaers Cabral" 3
>> "B" "Rue Pedro Alvares Cabral" 1
>> "B" "Av. Pedro Alvares Cabral" 20
>> "B" "Rua other"  45
>> end
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> *   http://www.ats.ucla.edu/stat/stata/
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index