Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: RE: AW: Data management: looking up content in observations


From   "Nick Cox" <[email protected]>
To   <[email protected]>
Subject   st: RE: AW: Data management: looking up content in observations
Date   Mon, 2 Mar 2009 10:53:53 -0000

You could write an -egen- function for this. I am not aware of one. But
that is not the only way to attack the problem, nor the most natural
way. 

I don't think -by:- is natural here. There's more than instinct behind
that statement, as it follows from the logic of the problem. The problem
entails comparing observations in different blocks of observations,
however "blocks" are defined. That is, the day will be different, the
home team and guest team may both be different, etc. -by:-, conversely,
is for problems in which you need only work _within_ blocks. 

As Martin pointed out, it helps to be thoroughly familiar with
subscripting for this kind of problem. He didn't spell out the mundane
details of any solution, so here is one way to do it.

I fall back on the often-deprecated "loop over observations". It is not
especially elegant or fast, but it is a direct attack on the problem and
does work. There are probably more cunning solutions entailing -merge-s
of the data with itself and so forth, but I'll still do it this way. 

gen winlast = .
gen obsno = _n  

qui forval i = 1/`=_N' { 
	su obsno if day == day[`i'] - 1 & ///
	(hometeam == hometeam[`i'] | guestteam == hometeam[`i']),
meanonly 
	if r(min) < . { 
		replace winlast = (winner[r(min)] == hometeam[`i']) in
`i' 
	}
} 

Notes:

1. I am assuming here that each team plays at most once per day. That is
not explicit, but is suggested by Florian's data segment. 

2. I am assuming that the total number of games in the dataset is modest
enough to use a -float- for -obsno-. In a bigger dataset than that,
specify that -obsno- is to be a -long-. 

3. There are no games before the first, so the loop need not start at 1,
but I'd rather leave it at 1 and let Stata do a little unnecessary work,
rather than wire in 5 and then create a source of bugs if the data get
out of -sort- order, or the code is ported to a different dataset for
which 5 is no longer the correct number. 

4. Florian hit the nail on the head in labelling this a "look up"
problem. So, we can think of it in two stages: 

* Which observation contains the details for the previous game with this
home team? 

* Did this home team win in that game? 

The first is, for observation `i', on the previous day and involves the
same team as the present home team, either at home or away, and will be
when this condition is satisfied: 

day == day[`i'] - 1 & ///
	(hometeam == hometeam[`i'] | guestteam == hometeam[`i'])

What we do is exploit what -summarize- leaves in memory. At most one
game should satisfy that condition, so that observation number will be
recorded in multiple places, as r(min), r(max), r(mean) and r(sum). It
is arbitrary which we use. 

(winner[r(min)] == hometeam[`i'])

will be 1 if the home team for this game was the winner in that game,
and 0 otherwise. 

5. However, suppose that a team didn't play on the previous day. Then
the -summarize- will return missing in r(min) and the comparison will be


(winner[.] == hometeam[`i'])

which will return 0, as -winner[.]- is evaluated as an empty string,
which will not equal any team name. That's wrong, as the answer should
be ., not 0. A similar issue arises with the first day's games. 

Thus, 

	if r(min) < . { 
		replace winlast = (winner[r(min)] == hometeam[`i']) in
`i' 
	}

is the more careful code needed to trap such difficulties. 

7. I assume that Florian meant 

count if day == 1 & winner == "F"

but my solution does not depend on -winner- being string or numeric,
just that -winner-, -hometeam-, -guestteam- are either all nmeric or all
string. 

There is a discussion of related technique in 

SJ-6-4  dm0025  . . . . . . . . . .  Stata tip 36: Which observations?
Erratum
        . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  N.
J. Cox
        Q4/06   SJ 6(4):596                              (no commands)
        correction of example code for Stata tip 36

SJ-6-3  dm0025  . . . . . . . . . . . . . .  Stata tip 36: Which
observations?
        . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  N.
J. Cox
        Q3/06   SJ 6(3):430--432                                 (no
commands)
        tip for identifying which observations satisfy some
        specified condition

Nick 
[email protected] 

Martin Weiss

well, why do you want an -egen- function? Note there is an -egen, count-
command already, which, in combination with -by-, might just do what you
want. -help subscripting- may also be useful.

Florian Kuhn

I am trying to find out if in a league winning the previous game has an 
effect on the current game. Specifically, I have 8 teams, named A to H.
I 
would like to construct the variable "winlast", being 1 if the current
home 
team won the last game and 0 otherwise.
The data is organized as follows:

Day hometeam guestteam winner (winlast)
1  A  H  A (.)
1  C  F  .  (.)
1  E  B  B  (.)
1  G  D  D  (.)
2  F  E  .  (0)
2  B  G  G  (1)
2  H  C  C  (0)
2  D  A  D (1)
3  G  E  E  (1)
...

That is, for each observation I would like to check whether the home
team is

listed as "winner" for the previous day. I get the right digit by (for 
example)

count if day == 1 & winner == F

but I have no idea of how to incorporate this into an egen command (that
is,

I had a lot of ideas none of which worked).
Does someone know how to get this right?

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index