|
Note: This FAQ is for users of Stata 11
It is not relevant for more recent versions.
What are the divisors used in xtgee?
|
Title
|
|
The divisor choice in xtgee
|
|
Author
|
David M. Drukker, StataCorp
|
|
Date
|
March 2000
|
Technical Note: This is a technical FAQ. Technical FAQs
address specific issues or computational aspects of estimators or other
commands. They are typically written for someone who already knows and
fully understands the statistics of the command—an expert in the area.
All of the materials in the Stata manual, including the Methods and Formulas
and sometimes the references, are assumed to be understood.
These FAQs often deal with issues that are not considered or adequately
addressed in the literature, and as such we welcome insights from readers or
related citations that we may have missed.
Divisors in xtgee
The default divisor for computing correlations and standard errors in
xtgee is N, the
number of observations in the dataset. With this divisor, the estimates are
invariant to the scale of the dataset. Scale invariance is an important
property. No weights are equivalent to frequency weights of one. If you
multiply these weights by a scale factor, you would not want your estimates
to change.
In some other packages and in some previous versions of xtgee, the
divisor is N − p, where p is the number of covariates in the model.
This divisor is used to obtain unbiased estimates. This divisor, however,
causes the estimates not to be scale invariant. As N goes to infinity, the
difference between the two divisors goes to zero.
In xtgee, if you specify the option nmp, the divisor N −
p will be used instead of the default divisor N.
The scaling issue also affects the normalization factor for the robust VCE
when family(gaussian) is specified. Historically, xtgee used
{(npanels)/(npanels − 1)}{(N − 1)/(N − p)}, where npanels
is the number of panels in the dataset and N and p are defined as above.
This normalization factor would prevent the VCE from being invariant to the
scale of the weights. For this reason, the default normalization factor is
now (npanels)/(npanels − 1) instead of {(npanels)/(npanels − 1)}
{(N − 1)/(N − p)}. One can use the previous normalization
factor by specifying the option rgf.
The divisor used in computing the unstructured and nonstationary correlation
matrices for each element in the correlation matrix has changed to the
number of panels that have valid observations for the ti and
tj defined by that element.
Why do my xtgee results differ from the results produced
by SAS Genmod?
With balanced data, xtgee and SAS version 6.12 Genmod will produce
the same results if the correct options are specified. For models with
family(Gaussian) and link(identity), the default options for
each correlation structure produce matching results. For non-Gaussian
models, Stata sets the scale coefficient to 1 by default. In contrast, SAS
Genmod defaults to setting the scale parameter to the Pearson chi-squared.
One can match the SAS Genmod results with Stata’s xtgee by
specifying the scale(x2) option.
For unbalanced data, Stata’s xtgee and SAS version 6.12 Genmod
do not produce exactly the same results. Theoretically, GEE estimation with
family(gaussian), link(identity), and
correlation(independent) should produce the same results as simple
linear regression. Stata’s xtgee does reduce to simple OLS
under these conditions; however, SAS Genmod does not. Instead of
SAS Genmod is using
This difference indicates that SAS Genmod forces the panels to have the same
weight instead of an implicit weighting determined by length as is done in
the first formula.
For more complex correlation structures, it is not possible to back out the
specific differences between the packages. Analogous differences in
computing the scale parameter and the parameters of the correlation matrix
are probably responsible for the differences in the results obtained by the
two packages.
To investigate the behavior of Stata’s xtgee, I conducted a
Monte Carlo study with 10,000 replications. The data were generated from a
process
yit=-1+x1it+2*x2it+3*x3it+4*x4it+eit
where eit is gaussian and autoregressive with a correlation
coefficient of .3.
On each of these datasets, I ran
. xtgee y x1 x2 x3 x4, family(gaussian) link(identity) i(id) corr(ar 1)
and saved the relevant information. (The code running these simulations can
be obtained by
clicking here.)
The following table gives the mean of the estimates that I obtained for the
coefficients and the autoregression parameter over the 10,000 runs. The
means of the estimates are all within .012 of the true parameter.
Variable | Obs Mean Std. Dev. Min Max
---------+-----------------------------------------------------
x1 | 10000 1.011026 .9790695 -2.778255 4.613505
x2 | 10000 1.993917 .4733509 .3084104 3.762877
x3 | 10000 2.999791 .2484509 2.03008 3.949788
x4 | 10000 3.999901 .2136448 3.205784 4.835913
cons | 10000 -1.005913 .5871637 -3.055032 1.362445
rho | 10000 .2996235 .0167369 .2390387 .3540029
I then checked the p-values that resulted from a hypothesis test
against the null that coefficient was its true value. Since there were
10,000 runs, one expects there to be about 500 p-values less than or
equal to .05. The following table indicates that this is what I found.
Variable | # of P-Values <=.05
----------|--------------------
x1 | 492
x2 | 489
x3 | 519
x4 | 515
_cons | 475
The results indicate that Stata’s xtgee is performing well on
unbalanced data.
Code and instructions
To run the simulations first, obtain ar_sim.ado by
clicking here.
Make sure to save the program as ar_sim.ado. Then within Stata, use
the simulate
command as follows
. set mem 10m
. simulate ar_sim, reps(10000)
For more information, see [R] simulate. Although exact execution
times will differ according to processor and disk drive speeds, running all
10,000 replications takes a long time.
|