Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: Re: Identify and delete duplicate obs

From   "Joseph Coveney" <>
To   <>
Subject   st: Re: Identify and delete duplicate obs
Date   Fri, 27 Dec 2013 09:29:58 +0900

Rongrong Zhang wrote:

my dataset has the following structure

industrynumber   naics
1000                     .
1001                      114
100101                  114
100102                   114

both variables are string .
the first observation has a missing value for naics. observations
sharing the same four digit (e.g. 1001) will have the same naics, I
would like to keep one observation only for the same naics

i used
bysort naics: gen dup=cond(_N==1, 0, _n)

this command will count missing naics as well, I have thousands of
missing naics records, in which case, dup is a large number.

how should I replace dup value when the observation is missing naics??

this did not work"replace  dup=-1 if naics==" "


Would something like that below do what you want?

I assume that you've got other variables that you want to retain, and that you're not interested in keeping one observation for each industrynumber-naics combination, but rather just as you said:  one observation for each combination of four-character prefix of industry number and naics (including missing naics if need be, but avoiding missing naics when possible).  I also assume that you want to carry along all other variables (not shown) in the same observation.  That's what the code below does.  If that's wrong, then perhaps you can clarify a bit.  

By the way, your "this did not work" comment implies that naics is a string variable, but your -list- output indicates that it's numeric.

Joseph Coveney

. input str6 industrynumber long naics

     industr~r         naics
  1. 1000                     .
  2. 1001                      114
  3. 100101                  114
  4. 100102                   114
  5. end

. generate str4 industry_nr = substr(industrynumber, 1, 4)

. bysort industry_nr (naics): keep if _n == 1
(2 observations deleted)

. drop industry_nr

. list, noobs separator(0) abbreviate(20)

  | industrynumber   naics |
  |           1000       . |
  |           1001     114 |

. exit

end of do-file

*   For searches and help try:

© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index