Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | Markus Eberhardt <markus.eberhardt@economics.ox.ac.uk> |
To | statalist@hsphsun2.harvard.edu |
Subject | st: Modelling extremely rare events (binary) |
Date | Tue, 14 Jun 2011 09:05:37 +0100 |
Hello everybody I have an empirical problem where for a very large dataset (panel, around 20,000 panel members with over 60,000 observations) I have two binary outcome variables A and B. The occurrence of either is extremely rare: only about 1.5% and 0.1% of observations for A and B respectively. I am for the time being treating this as a pooled panel, so not accounting for any fixed effects at the panel member level. My empirical model is made up of continuous and binary variables. In the logit and probit I am estimating A and B separately, for biprobit jointly, for mlogit I have four categories (0, A occurrs, B occurrs, both occurr). Ideally the analysis does account for the jointess of the decision as in the biprobit and mlogit approaches. Here are my questions: (1) DOES THIS AT ALL MAKE SENSE? Having estimated logit, probit, bivariate probit and multinomial logit I am concerned about the viability of what I am doing to this data: given the minute share of actual events occurring (1s, rather than 0s) is it at all possible that a logit-type model could tell me anything meaningful? So far I am getting interpretable empirical results, but it was put to me that these were entirely unreliable (or even spurious) given the extreme rarety of the event. Note that there are strong priors (from the descriptive analysis) that a certain characteristic (binary) drives the outcomes, so I imagine that a fixed effect and/or an interaction of this binary characteristic with other (continuous) RHS variables may provide an intuitive 'fit', but I am unsure whether this is empirically satisfied. (2) USEFUL DIAGNOSTICS? My diagnostics for the model(s) are hampered by the fact that it's difficult to get a handle on what constitutes a substantial deviation for the predicted from the observed outcomes. Apart from -fitstat- type diagnostics, are there any other things I could do to chose between rival models and/or to convince myself that what I'm doing is at all meaningful in this challenging empirical case? (3) ALTERNATIVE EMPIRICAL MODELS? Are there any other empirical specifications that are better suited to fit this data? I tried to search for extremely rare events such as earthquakes, but couldn't get much out of it. (4) PANEL ELEMENT? Possibly a bridge too far, but would there be any option to get the panel element of the data to have a bearing on the empirics. Thanks a lot in advance. markus Markus Eberhardt ESRC Post-doctoral Research Fellow, Centre for the Study of African Economies, Department of Economics, University of Oxford Stipendiary Lecturer, St Catherine's College, Oxford web: http://sites.google.com/site/medevecon/home email: markus.eberhardt@economics.ox.ac.uk twitter: http://twitter.com/sjoh2052 mail: Centre for the Study of African Economies, Department of Economics, Manor Rd, Oxford OX1 3UQ, England * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/