Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Repeated names in a string variable, but some have typos. How to correct?
"Dimitriy V. Masterov" <[email protected]>
Statalist <[email protected]>
Re: st: Repeated names in a string variable, but some have typos. How to correct?
Fri, 4 Apr 2014 11:26:10 -0700
If I understood your problem, you can try something like this using
Julian Reif's strgroup:
ssc install strgroup
bys city: strgroup street_name, gen(group) threshold(0.25)
city street_name number~s group
A Rua Santos Dumont 1200 1
A Rua Santos Dummont 30 1
A Rua Satos Dumont 3 1
A Rua Bandim 60 2
B Rua Pedro Alvares Cabral 4000 3
B Rua Pedro Alvaers Cabral 3 3
B Rue Pedro Alvares Cabral 1 3
B Av. Pedro Alvares Cabral 20 3
B Rua other 45 4
This relies on the city name having a single correct spelling. If
that's not the case, you can apply this strategy to the city name
first. It won't work with nick names (Frisco for San Francisco, to
give a US example).
You will want to play around with the threshold to match it to you
tolerance for different types of misclassification.
On Fri, Apr 4, 2014 at 8:29 AM, Lucas Ferreira Mation
<[email protected]> wrote:
> statalisters,
> I have a large addresses database, identifying street_names, street_number
> and city, which I need to collapse by street_name and city. Because the
> street_names can have some typos for some street_numbers, when I collapse
> some streets appear duplicated within cities (see example bellow)
> Duplicated street_names between cities would be OK.
> Is there a command to do some sort of probabilistic/fuzzy string comparison
> among the rows of a string variable (similar to what reclink does but
> with-in the variable)?
> The dataset is quite large, after collapsing I get 2.3 million
> cit-street_name pairs. So I need a smart way to go about it.
> *Example of the data after collapsing:
> clear
> input str1 city str24 street_name number_of_obs
> "A" "Rua Santos Dumont" 1200
> "A" "Rua Santos Dummont" 30
> "A" "Rua Satos Dumont" 3
> "A" "Rua Bandim" 60
> "B" "Rua Pedro Alvares Cabral" 4000
> "B" "Rua Pedro Alvaers Cabral" 3
> "B" "Rue Pedro Alvares Cabral" 1
> "B" "Av. Pedro Alvares Cabral" 20
> "B" "Rua other" 45
> end
> *
> * For searches and help try:
> *
> *
> *
* For searches and help try: