Notice: On March 31, it was **announced** that Statalist is moving from an email list to a **forum**. The old list will shut down on April 23, and its replacement, **statalist.org** is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Nick Cox <n.j.cox@durham.ac.uk> |

To |
"'statalist@hsphsun2.harvard.edu'" <statalist@hsphsun2.harvard.edu> |

Subject |
RE: st: How to drop low frequency patterns from panel data |

Date |
Fri, 3 Feb 2012 16:43:13 +0000 |

That's a good question. It's the kind of tricky detail that is fascinating to some and exasperating to others. 1. The -by()- option for -egen-'s -total()- function is undocumented, but it still works. What -total()- does is (temporarily) -sort- the data as is needed, including what is needed for the -by()- option to work. -egen- as mother command is what ensures that any changes in -sort- order made by a daughter function are reversed. So, it does what you want. 2. In contrast, with -by:- it is more nearly the other way round. Calling anything under -by:- requires that the data are sorted as needed, and nothing will happen with -egen- if that isn't true. You can arrange to do the -sort-ing on the fly with e.g. bysort patternvar: egen IDcount = total(tag) Also, the -sort-ing precedes the -egen- call, so the data will stay -sort-ed. That isn't a different rule, as it is still true that -egen- doesn't change the -sort- order that it receives. So, although the result in terms of the values of the resulting variable should be identical, the sequence is quite different: 1. -egen- calls -total()- which does any sorting needed (including that implied by -by()- option) which is then undone by -egen- which quits. 2. -by:- calls -egen- if and only if data are sorted properly which calls -total()-. In fact -total()- may still temporarily change the -sort- order but as before that is reversed by -egen- if needed. Restoring the -sort- order is something that happens off-stage with code that is part of the executable, but the heart of the answer can be seen by looking closely at the code for -total()- which you can do by . viewsource _gtotal.ado Nick n.j.cox@durham.ac.uk -----Original Message----- From: owner-statalist@hsphsun2.harvard.edu [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of Kim Peeters Sent: 03 February 2012 15:51 To: statalist@hsphsun2.harvard.edu Subject: Re: st: How to drop low frequency patterns from panel data Thank you for the solution. As usual, it is straightforward and logical. :-) Another related question. My data set is sorted ID Year. When I run the following code: xtset ID Year xtpatternvar,gen(patternvar) egen tag = tag(ID patternvar) egen IDcount = total(tag), by(patternvar) drop if IDcount < 20 everything works fine. However, if I replace the second last line by: by patternvar: egen IDcount = total(tag) I get error code five: not sorted. Why does a by-prefix result in an understandable sort-error, whereas a by-suffix works fine. Best regards, Kim ----- Original Message ----- From: Nick Cox <njcoxstata@gmail.com> To: statalist@hsphsun2.harvard.edu Cc: Sent: Friday, February 3, 2012 2:56 PM Subject: Re: st: How to drop low frequency patterns from panel data Your first posting in fact showed a good sense that -egen-'s -tag()- could be part of an answer. The logic is that each distinct pattern must be tagged just once for each distinct panel; otherwise we count more occurrences than we want. So the -tag()- argument has to be ID patternvar Once precisely what we want to count has been tagged with 1s, adding them up gives the frequency. It doesn't matter that we add up the 0s too, as manifestly they don't count (all puns should be considered deliberate). Nick On Fri, Feb 3, 2012 at 1:46 PM, Kim Peeters <kimpeeters84@yahoo.com> wrote: > Thank you Nick! > > > ----- Original Message ----- > From: Nick Cox <njcoxstata@gmail.com> > To: statalist@hsphsun2.harvard.edu > Cc: > Sent: Friday, February 3, 2012 12:29 PM > Subject: Re: st: How to drop low frequency patterns from panel data > > Sounds more like > > egen tag = tag(ID patternvar) > egen IDcount = total(tag), by(patternvar) > drop if IDcount < 20 > > For the kind of logic here, see if desired > > SJ-8-4 dm0042 . . . . . . . . . . . . Speaking Stata: Distinct observations > (help distinct if installed) . . . . . . N. J. Cox and G. M. Longton > Q4/08 SJ 8(4):557--568 > shows how to answer questions about distinct observations > from first principles; provides a convenience command > > > > On Fri, Feb 3, 2012 at 10:37 AM, Kim Peeters <kimpeeters84@yahoo.com> wrote: >> Dear Nick, >> >> Thank you for your fast reply and my apologies for not mentioning that -xtpatternvar- is a user-written command. Unfortunately, the solution that you suggest does not solve my question. I admit that my question was not clear. :-) >> >> Observations (i.e. persons) have multiple rows (one row for every year) of data. The code that you suggest loops through the entire data set and drops the patterns that occur less than twenty times in the entire data set, regardless of the number of rows within observations. However, the solution I’m looking for should drop all persons that share the same pattern if that pattern occurs less than twenty time (i.e. if less than twenty persons have the same pattern). >> >> Thank you for your advice. >> >> Best regards, >> Kim >> >> >> >> ________________________________ >> From: Nick Cox <njcoxstata@gmail.com> >> To: statalist@hsphsun2.harvard.edu >> Sent: Friday, February 3, 2012 10:35 AM >> Subject: Re: st: How to drop low frequency patterns from panel data >> >> -xtpatternvar- is a user-written command from SSC. Please remember to >> explain where user-written programs you refer to come from. >> >> bysort pattern : drop if _N < 20 >> >> is I think what you seek. >> >> Nick >> >> On Fri, Feb 3, 2012 at 9:22 AM, Kim Peeters <kimpeeters84@yahoo.com> wrote: >> >>> I have an unbalanced panel data set. The yearly data spans a period of almost twenty years. However, most subjects only participated in the last years of the study, which is confirmed by the analysis of the different panel patterns using -xtdescribe-. While some patterns' frequency is >1000, other patterns only occur once. To improve the data quality, I would like to drop all patterns that occur less than twenty times. >>> >>> I have not been able to accomplish this. Thus far, I can only re-generate the -xtdescribe- output again. >>> xtpatternvar,gen(pattern) >>> egen tag =tag(ID) >>> tabulate pattern if tag, sort >>> >>> Any advice on how to drop low frequency patterns from panel data? > > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/statalist/faq > * http://www.ats.ucla.edu/stat/stata/ > > > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/statalist/faq > * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**References**:**st: How to drop low frequency patterns from panel data***From:*Kim Peeters <kimpeeters84@yahoo.com>

**Re: st: How to drop low frequency patterns from panel data***From:*Nick Cox <njcoxstata@gmail.com>

**Re: st: How to drop low frequency patterns from panel data***From:*Kim Peeters <kimpeeters84@yahoo.com>

**Re: st: How to drop low frequency patterns from panel data***From:*Nick Cox <njcoxstata@gmail.com>

**Re: st: How to drop low frequency patterns from panel data***From:*Kim Peeters <kimpeeters84@yahoo.com>

**Re: st: How to drop low frequency patterns from panel data***From:*Nick Cox <njcoxstata@gmail.com>

**Re: st: How to drop low frequency patterns from panel data***From:*Kim Peeters <kimpeeters84@yahoo.com>

- Prev by Date:
**Re: st: Re: st: Re: st: RE: Calculatinâ??g the shortest dist ances between observatioâ??ns (based on longitude and latitu de)** - Next by Date:
**st: RE: st: Re: st: Re: st: RE: Calculatinâ€‹g the shortest dist ances between observatioâ€‹ns (based on longitude and latitu de)** - Previous by thread:
**Re: st: How to drop low frequency patterns from panel data** - Next by thread:
**st: Rehape help** - Index(es):