[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: By-groups and expressions containing _n and _N

From	Roger Newson <[email protected]>
To	[email protected]
Subject	st: By-groups and expressions containing _n and _N
Date	Tue, 05 Aug 2003 20:57:15 +0100

Fellow Statalisters (especially StataCorp):

I have a query re the interaction between by-groups and expressions containing _n or _N. If a Stata command has by-groups (either using -by ... :- or using a -by()- option), then it can interpret the reserved names _n and _N in expressions in if-qualifiers and weights in one of two ways. It can either interpret them as the observation order and observation number, respectively, within the current by-group, or it can interpret them as the observation order and observation number, respectively, within the whole data set. I note that some official Stata commands interpret them in the first way and that other official Stata commands interpret them in the second way. In particular, estimation commands implemented in the executable (such as -logit-, -mlogit-, -probit- and -regress-) seem to interpret them as referring to the current by-group, whereas estimation commands implemented as ado-files (eg -glm- and -logistic-) seem to interpret them as referring to the whole data set. For instance, if, in the -auto- data, you type

sysuse auto,clear
gene byte odd=mod(_n,2)
by foreign:logit odd mpg if _n<=10,robust
by foreign:logistic odd mpg if _n<=10,robust

then -logit- does 2 analyses (1 for the first 10 US cars and 1 for the first 10 non-US cars), and -logistic- does 1 analysis for the first 10 US cars and fails with "no observations" for the non-US cars.

This, of course, is a minor feature of Stata, which does not affect most people, most of the time, although it might ideally be worth a Technical Note in -[R] by- and/or -[P] byable- and/or the corresponding on-line help. (It is not mentioned in -[U] 14.5 by varlist: construct-.) And, obviously, users can write their own programs either way. (I have done some of each in my programs so far.) However, I couldn't help noticing that, in the transition from Stata 7 to Stata 8, the official Stata -statsby- command changed from the by-group-wise interpretation to the data-set-wise interpretation, at the price of having to dismantle the command to be run into sub-clauses before running it. Is this just a local change specific to -statsby- to reduce file-processing and increase speed? Or is it a sign of an unpublicised long-term policy change at StataCorp implying that, in future, data-set-wise interpretation will be increasingly encouraged and by-group-wise interpretation will be officially deprecated as being out-of-date? And does StataCorp have guidelines of good practice as to which mode should be used when?

Best wishes (and thanks in advance)

Roger

--
Roger Newson
Lecturer in Medical Statistics
Department of Public Health Sciences
King's College London
5th Floor, Capital House
42 Weston Street
London SE1 3QD
United Kingdom

Tel: 020 7848 6648 International +44 20 7848 6648
Fax: 020 7848 6620 International +44 20 7848 6620
or 020 7848 6605 International +44 20 7848 6605
Email: [email protected]
Website: http://www.kcl-phs.org.uk/rogernewson

Opinions expressed are those of the author, not the institution.

*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/

Prev by Date: st: RE: string values in a gen command
Next by Date: st: string values in a gen command
Previous by thread: st: string values in a gen command
Next by thread: st: Mahalanobis Distance in Stata?
Index(es):
- Date
- Thread