Statalist The Stata Listserver


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: Which observations ... ?


From   n j cox <n.j.cox@durham.ac.uk>
To   statalist@hsphsun2.harvard.edu
Subject   st: Which observations ... ?
Date   Mon, 31 Jul 2006 13:42:10 +0100

(Previous messages edited slightly.)

1. Alan H. Feiveson asked on 24 July 2006:

Does anyone know an efficient way to identify the observation at
which a particular variable is minimum or maximum (subject to `if'
and/or `in') ?

Apparently -summarize- does not return this value. I see nothing in
-egen- nor does -findit argmax- produce anything. I can program this
myself by looping through the observations, but that is not efficient. In particular one cannot guarantee that anything like

summ x
local xmax = r(max)
if x = `xmax' {
...

will work because of rounding. I also wish to avoid -preserve-,
-collapse-, etc.

[NJC note: -if x = `xmax'- is not legal code.]

2. Maarten Buis suggested:

something like:

sysuse auto, clear
tab rep78
gen cond = rep78==3 & price < .
bys cond (price): gen max = _n==_N & cond==1
sum price if cond
sum price if max==1

also see:
viewsource _gmax.ado
for similar code that you should be able to adjust
for your purpose.

3. Jeph Herrin suggested:

To avoid the rounding problem you can try

gen cond= [if and/or in statments]
bysort cond: egen xmax=max(x)

where "mycond" should probably include x>=.

4. Joseph Coveney suggested:

Couldn't you just -generate- a 0/1 indicator variable? Then just use the indicator in your Boolean expression: -if indicator_variable . . .-

Generating such a variable (1) allows for both -if- and -in-, (2) won't be affected by missing values in the target variable, and (3) doesn't appear to be liable to rounding errors regardless of whether the target variable is
single- or double-precision: one (and only one) maximum observation is
identified in each of 1000 200-observation datasets.

clear
set more off
set seed `=date("2006-07-25", "ymd")'
set matsize 10000
tempname A
tempvar a max
set obs 200
generate double `a' = . // double-precision
generate byte `max' = 0
forvalues i = 1/1000 {
quietly replace `a' = uniform()
summarize `a', meanonly
quietly replace `max' = (`a' == r(max)) if (1==1) in 1/200
summarize `max', meanonly
matrix define `A' = (nullmat(`A') \ r(sum))
}
drop _all
svmat byte `A', names(col)
assert c1 == 1
*
clear
set obs 200
generate float `a' = . // single-precision
generate byte `max' = 0
forvalues i = 1/1000 {
quietly replace `a' = uniform()
summarize `a', meanonly
quietly replace `max' = (`a' == r(max)) if (1==1) in 1/200
summarize `max', meanonly
matrix define `A' = (nullmat(`A') \ r(sum))
}
drop _all
svmat byte `A', names(col)
assert c1 == 1
exit

5. Alan then replied:

Thanks, Joseph, Jeph and Maarten for your helpful suggestions. I
incorporated these ideas into the following program which returns
scalars holding the observations with the min and max. I found the
easiest way to get those was to sort the original observation number
("ord" in the program below) along with the variable of interest. Then
after the bysorts, I resorted with respect to "ord" to get back the
original order.

program define argmax1
// finds observations with `x' = max and `x' = min
version 9.2
syntax anything [if] [in]
tokenize `anything'
args x kmin kmax
tempvar ind ord rep N
qui gen byte `ind'=0
qui replace `ind'=1 `if' `in'
qui gen int `ord'=_n
bysort `ind' (`x'): gen `rep'=_n
bysort `ind' (`x'): gen `N'=_N
summ `ord' if `ind'==1 & `rep'==1,meanonly
scalar `kmin'=r(mean)
summ `ord' if `ind'==1 & `rep'==N,meanonly
scalar `kmax'=r(mean)
sort ord
end

---------------------------------- end of thread

Alan's question can clearly be generalised in various ways. Let's be
ambitious and go for a very general question: Which observations...?
I don't guarantee to remember every detail and complication that might
arise, but it turns out that you need less code than may appear
to answer more general questions.

The basic ingredients to an answer are going to be:

1. At least an -if- condition and quite possibly an -in- condition too.
Even if we start out interested in all observations, the question
will turn into something like

show observation numbers if value == maximum

2. The observation numbers themselves. Evidently some commands will
show them (-list- being one), but otherwise we will need to
work a little harder and do something like

. gen obsno = _n

and work with that new variable. Whether the "something like" is
generating a temporary variable within a program, as Alan's program
exemplifies, or is generating a permanent variable -- at least one
that remains in memory until it is -drop-ped -- is an important detail,
but still a detail.

What complications will we need to worry about?

a. Precision problems with non-integer values. Alan flagged this in
his initial posting.

b. By default the variable generated above will be a -float- (unless
you messed with -set type-). This will not be precise enough for
very large datasets, for which a -long- will be safer.

c. Ties, i.e. more than one observation may satisfy a specified
condition.

d. String comparisons as well as numeric.

e. No doubt others that do not spring to mind, but that's enough for
one posting.

Consider first a very easy example that we know how to solve:
which observations in the auto data have repair record 3?

. sysuse auto, clear
(1978 Automobile Data)

. list rep78 if rep78 == 3

+-------+
| rep78 |
|-------|
1. | 3 |
2. | 3 |
4. | 3 |
6. | 3 |
8. | 3 |
|-------|
9. | 3 |
10. | 3 |
11. | 3 |
13. | 3 |
14. | 3 |
|-------|
16. | 3 |
19. | 3 |
25. | 3 |
26. | 3 |
27. | 3 |
|-------|
28. | 3 |
31. | 3 |
32. | 3 |
34. | 3 |
36. | 3 |
|-------|
37. | 3 |
39. | 3 |
41. | 3 |
42. | 3 |
44. | 3 |
|-------|
49. | 3 |
50. | 3 |
54. | 3 |
60. | 3 |
65. | 3 |
+-------+

shows us the observation numbers for a particular condition, but
neither compactly nor accessibly. To get a more compact display,
one approach is

. gen long obsno = _n

. levelsof obsno if rep78 == 3
1 2 4 6 8 9 10 11 13 14 16 19 25 26 27 28 31 32 34 36 37 39 41 42 44 49 50 54 60 65

In an (updated) Stata 8, use -levels- not -levelsof-. The help for -levelsof- shows that you can put the observation numbers into a local macro for further manipulation, as I did in a posting only yesterday.

If you want the -obsno- variable for this purpose, you might want it shortly for something similar, so it might as well be left in memory. But bear in mind that -obsno- will remain identical in contents to -_n- only so long as the sort order is not changed.

. assert obsno == _n

is a good way to check whether that remains true. -assert- emits no output if the assertion made is true for every observation, no news thus
being good news in this example.

This approach takes clear of complications (b) large datasets and (c) ties and can be applied to complication (d), string comparisons. You can also add whatever other -if- or -in- conditions apply.

What about complication (a), the precision problem? Consider

. su gear

Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
gear_ratio | 74 3.014865 .4562871 2.19 3.89

. levelsof obsno if gear == 3.89


shows nothing (N.B.), and so fails to find the observation(s), whereas

. levelsof obsno if gear == float(3.89)
56

happens to give the right answer, but you will not always be so lucky in
other circumstances, in that what you see (3.89) might be more rounded than it should be. Coming now directly to examples more like Alan's

. su gear, meanonly

. levelsof obsno if gear == r(max)
56

gives the right answer, as does in this example

. su gear, meanonly

. di `r(max)'
3.8900001

. levelsof obsno if gear == `r(max)'
56

Note incidentally that since -levelsof- is r-class it will
overwrite the r-class results left behind by -summarize-,
so you will need to issue the commands in the right order.

I would nevertheless place more emphasis on using r(max) rather
than `r(max)' as the former gives you access to the maximum precision possible. (A similar comment applies to e-class results.)

Alan showed code similar in spirit to

summ x, meanonly
local xmax = r(max)
... if x == `xmax'

but this is like driving from Houston to Dallas when you
could have stayed in Houston. Do this instead:

summ x, meanonly
... if x == r(max)

eliminating the middle macro.

You can see the advantage of r() by harder tests, such as

. gen loggear = log(gear)

. su loggear, meanonly

. levelsof obsno if loggear == r(max)
56

Turning now to Alan's program, repeated here, various bugs
and limitations should be noted.

program define argmax1
// finds observations with `x' = max and `x' = min
version 9.2
syntax anything [if] [in]
tokenize `anything'
args x kmin kmax
tempvar ind ord rep N
qui gen byte `ind'=0
qui replace `ind'=1 `if' `in'
qui gen int `ord'=_n
bysort `ind' (`x'): gen `rep'=_n
bysort `ind' (`x'): gen `N'=_N
summ `ord' if `ind'==1 & `rep'==1,meanonly
scalar `kmin'=r(mean)
summ `ord' if `ind'==1 & `rep'==N,meanonly
scalar `kmax'=r(mean)
sort ord
end

The last line of the program proper should be

sort `ord'

and similarly the reference to

`rep'==N

should be

`rep'==`N'

If we make the counting variables -long- not -float-,
fixing another bug, and make a few other simplifications, we get

program argmax2, sort
// finds observations with `x' = max and `x' = min
version 9.2
syntax anything [if] [in]
tokenize `anything'
args x kmin kmax
tempvar ind ord rep
qui {
gen byte `ind' = 0
replace `ind' = 1 `if' `in'
gen long `ord' = _n
bysort `ind' (`x'): gen long `rep' = _n
count if `ind'
local N = r(N)
}

summ `ord' if `ind' == 1 & `rep' == 1, meanonly
scalar `kmin' = r(mean)
summ `ord' if `ind' == 1 & `rep' == `N', meanonly
scalar `kmax' = r(mean)
end

But some problems remain. The documentation here is the
comment line, but it is misleading. If there are any
missing values, they will be picked out as being "maximum".
(They could be excluded explicitly in the -if- fed to
the program.) If there are any ties for minimum or maximum, only one
of the observations containing either will be reported,
and the observation number reported may not even be
reproducible. Further, there are no checks that the
first argument fed to the program is the name of a
numeric variable, which it must be for the program to
work.

These are all soluble problems, but I won't spell out
solutions here.

The problem tackled by the program -- indeed a more
general one as ties will be documented properly -- is
thus solved by fewer lines of interactive code

gen long obsno = _n
su x, meanonly
levelsof obsno if x == r(max)
su x, meanonly
levelsof obsno if x == r(min)

and that approach is easily modified for other problems.

Nick
n.j.cox@durham.ac.uk
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/




© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index