Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: st: Binary model with many zeros and few ones

From   Cameron McIntosh <>
Subject   RE: st: Binary model with many zeros and few ones
Date   Fri, 6 Jan 2012 08:49:17 -0500

Absolutely right... I would recommend:

Maalouf, M., & Trafalis, T.B. (2011). Robust weighted kernel logistic regression in imbalanced and rare events data. Computational Statistics & Data Analysis, 55(1), 168-183.

Newman, T.B. (1995). If Almost Nothing Goes Wrong, Is Almost Everything All Right? Interpreting Small Numerators. JAMA, 274(13), 1013. 

King, G., & Zeng, L. (2001a). Explaining Rare Events in International Relations. International Organization, 55(3), 693-715.
King, G., & Zeng, L. (2001). Logistic Regression in Rare Events Data. Political Analysis, 9(2), 137-163. ;
Tomz, M., King, G., & Zeng, L. (2003). ReLogit: Rare Events Logistic Regression. Journal of Statistical Software, 8(2).
Quigley, J., & Revie, M. (2011). Estimating the Probability of Rare Events: Addressing Zero Failure Data. Risk Analysis, 31(7), 1120–1132.
Quigley, J., Hardman, G., Bedford, T., & Walls, L. (2011). Merging expert and empirical data for rare event frequency estimation: Pool homogenisation for empirical Bayes models. Reliability Engineering & System Safety, 96(6), 687-695. 
Zelig (R) does this too, for those interested:

Imai, K., King, G., & Lau, O. (January 2, 2012). Everyone’s Statistical Software, Package ‘Zelig’, Version 3.5-1.

Imai, K., King, G., & Lau, O. (2007). Zelig: Everyone’s Statistical Software.

Imai, K., King, G., & Lau, O. (2008). Toward A Common Framework for Statistical Analysis and Development. Journal of Computational Graphics and Statistics, 17(4),  892-913.


> Date: Fri, 6 Jan 2012 11:33:36 +0000
> Subject: Re: st: Binary model with many zeros and few ones
> From:
> To:
> Zero inflation as I understand it applies to situations in which there
> is some kind of mixture of individuals who are zero for one reason and
> individuals who are zero or one for another reason. For example, many
> people never visit football matches and some may visit football
> matches but just didn't do so during some survey period.  I don't
> think your description here justifies that term. Some people might
> want to describe your situation as one of  rare events and you might
> want to Google "Gary King rare events logit". But that said, I would
> certainly try -logit- or -probit- first.
> Nick
> On Fri, Jan 6, 2012 at 11:15 AM, Nikolaos Kanellopoulos
> <> wrote:
> > I have a dataset of around 880 thousand observations and I want to measure as accurately as possible the relationship between certain variables and an event described by a binary variable. My dependent variable has very few ones (around 1.5% of the observations).
> >
> > My question, and I apologize in advance if this has been asked in the Statalist before, which is the best way to analyse this “zero inflated” binary variable? Is it OK to use a simple probit or logit model? Any suggestions/references are more than welcome.
> *
> *   For searches and help try:
> *
> *
> *
*   For searches and help try:

© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index