Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Re: Identify and delete duplicate obs
From 
 
Rongrong Zhang <[email protected]> 
To 
 
[email protected] 
Subject 
 
Re: st: Re: Identify and delete duplicate obs 
Date 
 
Fri, 27 Dec 2013 13:02:18 -0500 
thank you, Joseph!!!
in my initial post, naics is a string , it only appears numeric in the
text. second, I see how your code works.
But I also tried
duplicates tag io_nr naics, gen(dup)
drop if dup>0
On Thu, Dec 26, 2013 at 7:29 PM, Joseph Coveney <[email protected]> wrote:
> Rongrong Zhang wrote:
>
> my dataset has the following structure
>
> industrynumber   naics
> 1000                     .
> 1001                      114
> 100101                  114
> 100102                   114
> ........
>
> both variables are string .
> the first observation has a missing value for naics. observations
> sharing the same four digit (e.g. 1001) will have the same naics, I
> would like to keep one observation only for the same naics
>
> i used
> bysort naics: gen dup=cond(_N==1, 0, _n)
>
> this command will count missing naics as well, I have thousands of
> missing naics records, in which case, dup is a large number.
>
> how should I replace dup value when the observation is missing naics??
>
> this did not work"replace  dup=-1 if naics==" "
>
> --------------------------------------------------------------------------------
>
> Would something like that below do what you want?
>
> I assume that you've got other variables that you want to retain, and that you're not interested in keeping one observation for each industrynumber-naics combination, but rather just as you said:  one observation for each combination of four-character prefix of industry number and naics (including missing naics if need be, but avoiding missing naics when possible).  I also assume that you want to carry along all other variables (not shown) in the same observation.  That's what the code below does.  If that's wrong, then perhaps you can clarify a bit.
>
> By the way, your "this did not work" comment implies that naics is a string variable, but your -list- output indicates that it's numeric.
>
> Joseph Coveney
>
> .
> . input str6 industrynumber long naics
>
>      industr~r         naics
>   1. 1000                     .
>   2. 1001                      114
>   3. 100101                  114
>   4. 100102                   114
>   5. end
>
> .
> . generate str4 industry_nr = substr(industrynumber, 1, 4)
>
> . bysort industry_nr (naics): keep if _n == 1
> (2 observations deleted)
>
> . drop industry_nr
>
> . list, noobs separator(0) abbreviate(20)
>
>   +------------------------+
>   | industrynumber   naics |
>   |------------------------|
>   |           1000       . |
>   |           1001     114 |
>   +------------------------+
>
> .
> . exit
>
> end of do-file
>
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/