Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Dropping observations with similar names (same prefix)


From   Nick Cox <[email protected]>
To   [email protected]
Subject   Re: st: Dropping observations with similar names (same prefix)
Date   Mon, 7 Mar 2011 16:38:02 +0000

You can truncate by

gen city2 = substr(city, 1, length(city)-1)

and then -drop- duplicates in terms of -city2-.

However, in your example the observations that are duplicates on the
first so many characters of -city- contain differing values on other
variables so that it is difficult to see why you want to do this.

Otherwise your failure to get what you expected arises from confusion
over what -substr()- does.

substr(name,-2,-10)

just evaluates to empty: it would be a string of length -10 characters
starting at position -2. Stata allows negative positions (counted from
the end backwards), but it ignores negative lengths. Thus the check

if name==substr(name,-2,-10) & name[_n-1]==substr(name,-2,-10)

is equivalent to

if name== "" & name[_n-1]== ""

which is evidently never satisfied in your data.

You were probably reaching twowards something like

if substr(name, 1, length(name)-2) == substr(name[_n-1], 1,
length(name[_n-1]-2)

but what I suggested at the start seems simpler in spirit and in
practice. Alternatively, consider something like

reverse(substr(reverse(name), strpos(reverse(name), " ") + 1, .))

which strips off the last "word" and the preceding space.

In steps

reverse(name)    reverses a name
strpos(reverse(name), " ") + 1 is the position after the last (now
first) space.
substr() extracts everything after that space
reverse() reverses the reverse.

Your type mismatch arises because the _result_ of the first equality
to be evaluated is numeric, which then can not be compared with a
string.

Nick

On Mon, Mar 7, 2011 at 3:58 PM, Ben Ammar <[email protected]> wrote:

 I was wondering how to drop observations (string)that merely differ
in the last letter?
> For example:
>
> City           Population     #household
> London A          400             34
> London B          300             12
> London F          600             66
> Hamburg B         200             54
> Hamburg G         400             59
>  ...             ...
>  ...             ...
> How can I drop those rows in which the Prefix (London, Hamburg)is the same,
> so that I only keep the first mentioned one(London A, Hamburg B)?
> Currently I do have 30,000 obs making a hand collection pretty difficult.
> First I tried
> .drop if name==substr(name,-2,-10) & name[_n-1]==substr(name,-2,-10)
>
> However 0 observations are deleted so I think the "&" sign is the problem (and in addition the length of the string differs from obs to obs...probably that's causing some problem, too)therfore, I tried:
> .drop if name==substr(name,-2,-10) == name[_n-1]==substr(name,-2,-10)
>
> but that resulted in a 'type mismatch'.
> Also I tried an approach like in the FAQs by creating an index for
> each suffix (" A"=1," B"=2,"C" etc.). However, I'm not sure if this does does necessarily exclude all possibilities how those observations could occur.

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index