Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

st: Modelling extremely rare events (binary)

 From Markus Eberhardt To statalist@hsphsun2.harvard.edu Subject st: Modelling extremely rare events (binary) Date Tue, 14 Jun 2011 09:05:37 +0100

```Hello everybody

I have an empirical problem where for a very large dataset (panel,
around 20,000 panel members with over 60,000 observations) I have two
binary outcome variables A and B. The occurrence of either is
extremely rare: only about 1.5% and 0.1% of observations for A and B
respectively. I am for the time being treating this as a pooled panel,
so not accounting for any fixed effects at the panel member level. My
empirical model is made up of continuous and binary variables. In the
logit and probit I am estimating A and B separately, for biprobit
jointly, for mlogit I have four categories (0, A occurrs, B occurrs,
both occurr). Ideally the analysis does account for the jointess of
the decision as in the biprobit and mlogit approaches.

Here are my questions:
(1) DOES THIS AT ALL MAKE SENSE? Having estimated logit, probit,
bivariate probit and multinomial logit I am concerned about the
viability of what I am doing to this data: given the minute share of
actual events occurring (1s, rather than 0s) is it at all possible
that a logit-type model could tell me anything meaningful? So far I am
getting interpretable empirical results, but it was put to me that
these were entirely unreliable (or even spurious) given the extreme
rarety of the event. Note that there are strong priors (from the
descriptive analysis) that a certain characteristic (binary) drives
the outcomes, so I imagine that a fixed effect and/or an interaction
of this binary characteristic with other (continuous) RHS variables
may provide an intuitive 'fit', but I am unsure whether this is
empirically satisfied.
(2) USEFUL DIAGNOSTICS? My diagnostics for the model(s) are hampered
by the fact that it's difficult to get a handle on what constitutes a
substantial deviation for the predicted from the observed outcomes.
Apart from -fitstat- type diagnostics, are there any other things I
could do to chose between rival models and/or to convince myself that
what I'm doing is at all meaningful in this challenging empirical
case?
(3) ALTERNATIVE EMPIRICAL MODELS? Are there any other empirical
specifications that are better suited to fit this data? I tried to
search for extremely rare events such as earthquakes, but couldn't get
much out of it.
(4) PANEL ELEMENT? Possibly a bridge too far, but would there be any
option to get the panel element of the data to have a bearing on the
empirics.

markus

Markus Eberhardt
ESRC Post-doctoral Research Fellow, Centre for the Study of African
Economies, Department of Economics, University of Oxford
Stipendiary Lecturer, St Catherine's College, Oxford