Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: Binomial regression


From   Constantine Daskalakis <C_Daskalakis@mail.jci.tju.edu>
To   statalist@hsphsun2.harvard.edu
Subject   st: Binomial regression
Date   Thu, 02 Aug 2007 14:18:30 -0400

Good day to all.

I would like to share some observations on binomial regression using
Stata (and SAS).

Essentially, the problem is that we have a binary outcome Y (0/1) and want to model it as a function of covariates (X1, X2, etc) via the generalized linear model

p = b0 + b1*X1 + b2*X2 + ...

So far, I've done such projects in Stata and I've noted that binomial regression sometimes fails to converge and/or gives ML estimates of the coefficients that yield estimated probabilities that are outside the [0,1] range (typically, for some modest number of observations).

A colleague suggested that SAS might give better results and I was skeptical at first. But I had some time in my hands (it's summer after all!), so I took a few datasets and run them in Stata (-glm-) and SAS (Proc Genmod).

The Stata (8.2) implementation I used is the default ML (Newton-Raphson) algorithm

- glm y x1 x2 ..., fam(bin) link(i) search

or sometimes

- glm y x1 x2 ..., fam(bin) link(i) search fisher(#)

where # is some number of iterations with the Fisher scoring algorithm.

The SAS (9.1) implementation I used is

proc genmod;
model y = x1 x2 ... / dist=bin link=id type3 wald itprint;
run;

[I also played with specification of starting values, but I will leave this piece out, and confine my posting to results obtained using the default starting values in both packages.]


Here's what I've found:

(1) Convergence

Stata often gets bogged down ("backed up") after a few iterations and
does not converge.

Specifying Fisher scoring for some iterations in the beginning helps.
After Newton-Raphson takes over from Fisher scoring, it occasionally does converge. Most often, I have to use Fisher scoring throughout to get convergence. But see point #3 below.

SAS does seem to often converge (on the basis of parameter vector convergence), but also warns that the "relative Hessian convergence criterion" has not been achieved and that "convergence is questionable" (indicating that the likelihood has not really converged sufficiently).

(2) Likelihood of final model

The log-likelihood of the final Stata model is often somewhat better than that of the final SAS model. This might suggest that the Stata results are "better". However, see the drawback in point #4 below.

(3) Estimated coefficients and standard errors

Naturally, when Stata and SAS give different final models, their
estimated coefficients are different.

But beware using Fisher's scoring throughout to get convergence and a final model. Sometimes, this final model will have absurdly small standard errors (with p < 0.001 for all variables). If something like this happens, it might be useful to compute standard errors using the option "OPG":

- glm y x1 x2 ..., fam(bin) link(i) search fisher(#) opg

[There are special complications when there are covariate levels that have observed probability of 0 or 1 (ie, all observations are "0s" or "1s"), but I'll leave this issue aside.]

(4) Estimated probabilities

When Stata has convergence trouble (and sometimes when it does not), it
warns that some "parameter estimates produce inadmissible mean estimates
in one or more observations."

SAS gives no such warnings.

When I compute the predicted probabilities from both the SAS and the
Stata final models (which often are not the same, as I explained above),
SAS almost always has them in the [0,1] interval but Stata has quite a
few of them outside the interval.

I most recently ran 7 different models in a dataset (9 different outcomes, each with the same 4 predictor variables). My total N was 278.

For 2 outcomes, SAS and Stata gave identical results (with all predicted probabilities in the [0,1] range).

For the remaining 5 outcomes, SAS always gave a final model (although with the warning regarding the Hessian convergence).

Stata w/ NR did not converge after 50 iterations.

Stata w/ FS did converge in 4 of the 5 cases (after 6-20 iterations).

In 4 of the 5 cases, the final model of Stata had a somewhat better log-likelihood than the final model of SAS.

But:

(i) Although SAS's estimated standard errors appeared sensible, Stata's (w/ FS) were clearly too small in 4 of the 5 cases.

(ii) The final SAS models never yielded a predicted probability less than 0 or greater than 1. In contrast, Stata's models yielded probabilities outside the [0,1] interval for 5-15% of the observations.


I've done this sort of comparison with a couple of other datasets with similar results. I know it is not proof positive (as opposed to a carefully constructed simulation), but I conclude that Stata's algorithm is not as robust as SAS's and would advise the use of SAS for binomial regression problems.

Perhaps I have missed something or perhaps other people have different experiences. So, please take this with the usual caution.


Best,
Constantine



--


The documents accompanying this transmission may contain confidential
health or business information. This information is intended for the
use of the individual or entity named above. If you have received
this information in error, please notify the sender immediately and
arrange for the return or destruction of these documents.

Constantine Daskalakis, ScD
Assistant Professor,
Thomas Jefferson University, Division of Biostatistics
1015 Chestnut St., Suite M100, Philadelphia, PA 19107
Tel: 215-955-5695
Fax: 215-503-3804
Email: c_daskalakis@mail.jci.tju.edu
Webpage: http://www.jefferson.edu/clinpharm/biostatistics/



*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/




© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index