Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: RE: Dropping Duplicates that Aren't Exactly Duplicates

From	"Dimitriy V. Masterov" <[email protected]>
To	[email protected]
Subject	Re: st: RE: Dropping Duplicates that Aren't Exactly Duplicates
Date	Wed, 2 Nov 2011 15:44:56 -0400

Lisa,

I think the inelegant code below will accomplish what you want. It is
untested and hinges on the violation variable being very clean. If the
latter is not the case, you may want to take a look at Google Refine.

/* remove leading, trailing and multiple whitespaces & convert to
uppercase (may not be necessary, but good habit with ) */
replace violation=upper(trim(itrim(violation)));

/* sencode is from ssc. This not necessary, but may speed sorting if
you have lots of data */
sencode violation, replace;

/* reshape to make finding duplicates easier */
bys id arrdate (violation): gen cause=_n;
reshape wide violation, i(id arrdate) j(cause);
egen all_violations=group(violation*), missing;
sort id arrdate all_violation;
duplicates drop id all_violations, force; // duplicate will drop all
by the first occurrence, which will the earliest arrest because of the
sort

/* reshape back to your original format & drop extraneous variables */
reshape long violation, i(id arrdate) j(cause);
drop if missing(violation);
drop all_violations cause;
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: RE: Dropping Duplicates that Aren't Exactly Duplicates
  - From: Lisa Chavez <[email protected]>

References:
- st: Dropping Duplicates that Aren't Exactly Duplicates
  - From: Lisa Chavez <[email protected]>
- st: RE: Dropping Duplicates that Aren't Exactly Duplicates
  - From: Nick Cox <[email protected]>
- Re: st: RE: Dropping Duplicates that Aren't Exactly Duplicates
  - From: Lisa Chavez <[email protected]>

Prev by Date: Re: st: negative binomial model in stata
Next by Date: st: How to label days of week and hours of day on x axis
Previous by thread: RE: st: RE: Dropping Duplicates that Aren't Exactly Duplicates
Next by thread: Re: st: RE: Dropping Duplicates that Aren't Exactly Duplicates
Index(es):
- Date
- Thread