Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
st: Re: Identify and delete duplicate obs
From
"Joseph Coveney" <[email protected]>
To
<[email protected]>
Subject
st: Re: Identify and delete duplicate obs
Date
Fri, 27 Dec 2013 09:29:58 +0900
Rongrong Zhang wrote:
my dataset has the following structure
industrynumber naics
1000 .
1001 114
100101 114
100102 114
........
both variables are string .
the first observation has a missing value for naics. observations
sharing the same four digit (e.g. 1001) will have the same naics, I
would like to keep one observation only for the same naics
i used
bysort naics: gen dup=cond(_N==1, 0, _n)
this command will count missing naics as well, I have thousands of
missing naics records, in which case, dup is a large number.
how should I replace dup value when the observation is missing naics??
this did not work"replace dup=-1 if naics==" "
--------------------------------------------------------------------------------
Would something like that below do what you want?
I assume that you've got other variables that you want to retain, and that you're not interested in keeping one observation for each industrynumber-naics combination, but rather just as you said: one observation for each combination of four-character prefix of industry number and naics (including missing naics if need be, but avoiding missing naics when possible). I also assume that you want to carry along all other variables (not shown) in the same observation. That's what the code below does. If that's wrong, then perhaps you can clarify a bit.
By the way, your "this did not work" comment implies that naics is a string variable, but your -list- output indicates that it's numeric.
Joseph Coveney
.
. input str6 industrynumber long naics
industr~r naics
1. 1000 .
2. 1001 114
3. 100101 114
4. 100102 114
5. end
.
. generate str4 industry_nr = substr(industrynumber, 1, 4)
. bysort industry_nr (naics): keep if _n == 1
(2 observations deleted)
. drop industry_nr
. list, noobs separator(0) abbreviate(20)
+------------------------+
| industrynumber naics |
|------------------------|
| 1000 . |
| 1001 114 |
+------------------------+
.
. exit
end of do-file
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/