Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | Lisa Chavez <lchavez@law.berkeley.edu> |
To | statalist@hsphsun2.harvard.edu |
Subject | Re: st: RE: Dropping Duplicates that Aren't Exactly Duplicates |
Date | Wed, 02 Nov 2011 13:17:10 -0700 |
On 11/2/2011 12:44 PM, Dimitriy V. Masterov wrote:
Lisa, I think the inelegant code below will accomplish what you want. It is untested and hinges on the violation variable being very clean. If the latter is not the case, you may want to take a look at Google Refine. /* remove leading, trailing and multiple whitespaces& convert to uppercase (may not be necessary, but good habit with ) */ replace violation=upper(trim(itrim(violation))); /* sencode is from ssc. This not necessary, but may speed sorting if you have lots of data */ sencode violation, replace; /* reshape to make finding duplicates easier */ bys id arrdate (violation): gen cause=_n; reshape wide violation, i(id arrdate) j(cause); egen all_violations=group(violation*), missing; sort id arrdate all_violation; duplicates drop id all_violations, force; // duplicate will drop all by the first occurrence, which will the earliest arrest because of the sort /* reshape back to your original format& drop extraneous variables */ reshape long violation, i(id arrdate) j(cause); drop if missing(violation); drop all_violations cause; * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/
* * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/