Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: RE: Dropping Duplicates that Aren't Exactly Duplicates


From   "Dimitriy V. Masterov" <[email protected]>
To   [email protected]
Subject   Re: st: RE: Dropping Duplicates that Aren't Exactly Duplicates
Date   Wed, 2 Nov 2011 15:44:56 -0400

Lisa,

I think the inelegant code below will accomplish what you want. It is
untested and hinges on the violation variable being very clean. If the
latter is not the case, you may want to take a look at Google Refine.

/* remove leading, trailing and multiple whitespaces & convert to
uppercase (may not be necessary, but good habit with ) */
replace violation=upper(trim(itrim(violation)));

/* sencode is from ssc. This not necessary, but may speed sorting if
you have lots of data */
sencode violation, replace;

/* reshape to make finding duplicates easier */
bys id arrdate (violation): gen cause=_n;
reshape wide violation, i(id arrdate) j(cause);
egen all_violations=group(violation*), missing;
sort id arrdate all_violation;
duplicates drop id all_violations, force; // duplicate will drop all
by the first occurrence, which will the earliest arrest because of the
sort

/* reshape back to your original format & drop extraneous variables */
reshape long violation, i(id arrdate) j(cause);
drop if missing(violation);
drop all_violations cause;
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index