Title | Weighted estimation and xtgee | |

Author | James Hardin, StataCorp | |

Date | February 1997 |

The answer to this question is not obvious. Here’s one:

We do not allow the weights to vary because it is too difficult to allow them to vary. Moreover, in the interesting cases we do not know what it means for the weights to vary, and how one would implement varying weights differs according to meaning.

The term “weighted estimation” is too vague. Why are you weighting? Below we present some cases.

Frequency weights are the easiest to discuss because their definition is unambiguous. Frequency weights are nothing more than shorthand for saying an observation is duplicated. However, even this case is difficult to generalize to panel data.

Consider a panel with frequency weight 4. What does that mean? Does it mean that there are four independent panels, each alike? Or does it mean there is one panel and that each observation is observed four times?

If there are 2 observations in a panel, each with a different frequency weight (one weighted 2 and the other weighted 4), what order are the 6 observations if I fit a time-dependent correlation structure?

As there are no easy answers to these questions and we have never seen a panel dataset reported as frequency weighted, we do not allow them.

Researchers weight data to make the variance homogeneous. This use of weighting is an alternative to transformation. That is, consider a model

y_{it}= X_{it}b + u_{it}

where

Var(u_{it}) = c/W_{it}

This model can be rewritten as

sqrt(W_{it}) y_{it}= sqrt(W_{it}) X_{it}b + sqrt(W_{it}) u_{it}

or

y*_{it}= X*_{it}b + u*_{it}

and now

Var(u*_{it}) = c

We provided analytic weights that can handle the special case where

Var(U_{it}) = c/W_{i}

but other cases you are going to have to handle by variable transformation.

There are lots of ways variances could be heterogeneous in a panel, so no matter what we did, variable transformation would probably have been required.

This, we think, is the common case. You have data on individuals, and the chance that each individual appears in your sample varies, so we are now going to discuss standard errors in the robust, replication sense (see [U] 20.15 Obtaining robust variance estimates).

Consider a probability-weighted sample. On day 1, the sample is drawn and then subsequently followed. In the simple case, a weight is assigned to each individual and that weight stays constant over time. This is not too difficult to model, and xtgee allows pweights.

Now consider what happens when the weights vary over time. We must ask, why do they vary. There are two possible answers: (1) the underlying population remains invariant but attrition affects our sample and (2) our sample remains whole but the underlying population changes. Both are complicated issues. Actually, we could combine (1) and (2) into another case where new members are added to our sample at a later date, generally to offset attrition effects (1).

These are hard questions, so let us just take case (2) and illustrate:

Pretend that we draw a sample of banks that we will follow over the next 6 years. Pretend that at some point the underlying distribution of banks changes—let’s use the banks’ size. Pretend that there are just two types of banks, small ones and large ones and, at some point, something changes and 80% of the small banks disappear (merge with large ones).

We will pretend there are lots of banks and that our sample is so small relative to the population that none of the banks in our sample are affected by this.

Consider the following possibilities:

- Scenario A: We select our sample on Monday, mail our first surveys on Tuesday, and while the surveys are in the mail, all the mergers happen.
- Scenario B: The mergers occur soon after the surveys are mailed back to us.
- Scenario C: The mergers occur 5 years after the conclusion of our study.
- Scenario D: The mergers occur the day after the conclusion of our study.
- Scenario E: The mergers occur the day before the conclusion of our study.
- Scenario F: The mergers occur 3 years into our study.

The point is that the solution to each of these cases is unlikely to be plugging some number, w, into the same formula.

Adding weights to the GEE calculation of the panel data GLM is not easy
because of the form of the equation. Note the update calculation for beta
in Methods and Formulas of [XT] xtgee (*Stata
Longitudinal/Panel Data Reference Manual*, p. 131) that is written as

b_{j+1}= b_{j}− (Σ_{i=1}^{m}D' V^{-1}D)^{-1}(Σ_{i=1}^{m}D' V^{-1}S)

This equation is analogous to the

(X'X)^{-1}(X'Y)

calculation for linear regression. Here is the formula for the V term (also on page 131):

V = A^{1/2}R A^{1/2}

Each of the terms is for a panel that is of size n_{i} x
n_{i} (and so really should be subscripted by i).

So, the question becomes, “Where do the weights fit in the calculation of V?”

If the panels are weighted (weights are constant within panels), then the
addition of weights is clear, as we can multiply this panel calculation by a
constant, but if the weights are allowed to be subject specific, it is not
clear how they affect the calculation of **V**. Adding subject-specific
weights is a difficult problem and is unsolved as far as we know.