[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Daniel Wilde <Daniel.Wilde@asi.org.af> |

To |
"statalist@hsphsun2.harvard.edu" <statalist@hsphsun2.harvard.edu> |

Subject |
st: Any other lists? |

Date |
Mon, 2 Mar 2009 12:08:01 +0000 |

Dear All, I hope this email finds you well. About a week ago I asked a questions on the Gini Coefficient which didn't receive any replies - probably because it was about macroeconomic statistics and not Stata as such. Does anybody know of a list which is specifically aimed at people who want to ask questions about data sources and the computation of statistics. To my mind many of the international datasets such as World Development Indicators are poorly referenced and it is unclear what the actual original source of the data is and whether all countries used consistent definitions. E.G. how did the World Bank calculate per capita income for Chad, did they undertake a household survey, did they just get the figure from another database, did Chad's government statistics office calculate this. I am always reluctant to just plug the data in, and it would be useful if there was some list where I could speak to people who have looked into all this in detail. Any reply would be greatly appreciated. Kind Regards, Daniel Wilde ________________________________________ From: owner-statalist@hsphsun2.harvard.edu [owner-statalist@hsphsun2.harvard.edu] On Behalf Of Nick Cox [n.j.cox@durham.ac.uk] Sent: Monday, March 02, 2009 10:53 AM To: statalist@hsphsun2.harvard.edu Subject: st: RE: AW: Data management: looking up content in observations You could write an -egen- function for this. I am not aware of one. But that is not the only way to attack the problem, nor the most natural way. I don't think -by:- is natural here. There's more than instinct behind that statement, as it follows from the logic of the problem. The problem entails comparing observations in different blocks of observations, however "blocks" are defined. That is, the day will be different, the home team and guest team may both be different, etc. -by:-, conversely, is for problems in which you need only work _within_ blocks. As Martin pointed out, it helps to be thoroughly familiar with subscripting for this kind of problem. He didn't spell out the mundane details of any solution, so here is one way to do it. I fall back on the often-deprecated "loop over observations". It is not especially elegant or fast, but it is a direct attack on the problem and does work. There are probably more cunning solutions entailing -merge-s of the data with itself and so forth, but I'll still do it this way. gen winlast = . gen obsno = _n qui forval i = 1/`=_N' { su obsno if day == day[`i'] - 1 & /// (hometeam == hometeam[`i'] | guestteam == hometeam[`i']), meanonly if r(min) < . { replace winlast = (winner[r(min)] == hometeam[`i']) in `i' } } Notes: 1. I am assuming here that each team plays at most once per day. That is not explicit, but is suggested by Florian's data segment. 2. I am assuming that the total number of games in the dataset is modest enough to use a -float- for -obsno-. In a bigger dataset than that, specify that -obsno- is to be a -long-. 3. There are no games before the first, so the loop need not start at 1, but I'd rather leave it at 1 and let Stata do a little unnecessary work, rather than wire in 5 and then create a source of bugs if the data get out of -sort- order, or the code is ported to a different dataset for which 5 is no longer the correct number. 4. Florian hit the nail on the head in labelling this a "look up" problem. So, we can think of it in two stages: * Which observation contains the details for the previous game with this home team? * Did this home team win in that game? The first is, for observation `i', on the previous day and involves the same team as the present home team, either at home or away, and will be when this condition is satisfied: day == day[`i'] - 1 & /// (hometeam == hometeam[`i'] | guestteam == hometeam[`i']) What we do is exploit what -summarize- leaves in memory. At most one game should satisfy that condition, so that observation number will be recorded in multiple places, as r(min), r(max), r(mean) and r(sum). It is arbitrary which we use. (winner[r(min)] == hometeam[`i']) will be 1 if the home team for this game was the winner in that game, and 0 otherwise. 5. However, suppose that a team didn't play on the previous day. Then the -summarize- will return missing in r(min) and the comparison will be (winner[.] == hometeam[`i']) which will return 0, as -winner[.]- is evaluated as an empty string, which will not equal any team name. That's wrong, as the answer should be ., not 0. A similar issue arises with the first day's games. Thus, if r(min) < . { replace winlast = (winner[r(min)] == hometeam[`i']) in `i' } is the more careful code needed to trap such difficulties. 7. I assume that Florian meant count if day == 1 & winner == "F" but my solution does not depend on -winner- being string or numeric, just that -winner-, -hometeam-, -guestteam- are either all nmeric or all string. There is a discussion of related technique in SJ-6-4 dm0025 . . . . . . . . . . Stata tip 36: Which observations? Erratum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N. J. Cox Q4/06 SJ 6(4):596 (no commands) correction of example code for Stata tip 36 SJ-6-3 dm0025 . . . . . . . . . . . . . . Stata tip 36: Which observations? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N. J. Cox Q3/06 SJ 6(3):430--432 (no commands) tip for identifying which observations satisfy some specified condition Nick n.j.cox@durham.ac.uk Martin Weiss well, why do you want an -egen- function? Note there is an -egen, count- command already, which, in combination with -by-, might just do what you want. -help subscripting- may also be useful. Florian Kuhn I am trying to find out if in a league winning the previous game has an effect on the current game. Specifically, I have 8 teams, named A to H. I would like to construct the variable "winlast", being 1 if the current home team won the last game and 0 otherwise. The data is organized as follows: Day hometeam guestteam winner (winlast) 1 A H A (.) 1 C F . (.) 1 E B B (.) 1 G D D (.) 2 F E . (0) 2 B G G (1) 2 H C C (0) 2 D A D (1) 3 G E E (1) ... That is, for each observation I would like to check whether the home team is listed as "winner" for the previous day. I get the right digit by (for example) count if day == 1 & winner == F but I have no idea of how to incorporate this into an egen command (that is, I had a lot of ideas none of which worked). Does someone know how to get this right? * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**st: RE: Any other lists?***From:*"Nick Cox" <n.j.cox@durham.ac.uk>

- Prev by Date:
**st: RE: AW: Data management: looking up content in observations** - Next by Date:
**st: RE: Any other lists?** - Previous by thread:
**st: poisson distribution, transformed variables and implications for xtreg fe or re** - Next by thread:
**st: RE: Any other lists?** - Index(es):

© Copyright 1996–2015 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |