Re: Re: st: importance weights

 From "Anders Alexandersson" To Subject Re: Re: st: importance weights Date Wed, 19 Feb 2003 11:10:32 -0600

```"B. Burcin Yurtoglu" <burcin.yurtoglu@u...> wrote:
> Many thanks to Anders Alexandersson for the response on iweight.
>
> I was just curious about how the iweight functions.  I do not have
> to use it. But, still I find it difficult to understand why and the
> how the number of observations change.

Here is Bill Gould's answer from over 5 years ago to my very similar
question at the time. Use the source, ...

Anders

---------------------------------------- BEGIN ---------------------
From: "William Gould" <wgould@stata.com>
To: statalist@hsphsun2.harvard.edu
Subject:Re: statalist: iweights and regress
Date: Fri, 30 Jan 1998 10:11:57 -0600

on iweights. Stay away from them, I say, because they will invariably surprise you.
Let me explain:

Stata provides four kinds of weights which are best described in terms of
their intended use:
fweights, or frequency weights, or duplication weights.
Specify these and Stata is supposed to produce the same answers
as if you replace each observation j with w_j copies of itself.
These are useful when the data is stored in a compressed way.
pweights, or sampling weights, or population weights.
Specify these and Stata is supposed to produce the right answers
for survey-sampled data.  w_j means that this observation is
random draw from a population of w_j similar observations.
aweights, or analytic weights.
The term "analytic" is made up by us.  There is no commonly used term
for what these weights indicate even though the problem they handle
arises, and is discussed commonly.  The data are means.  You do not
observe individual y's and X's; you observe average values of y's and
X's with the averages being calculated over w_j observations.  Specify
these kinds of weights and Stata is supposed to produce the correct
answer for these kinds of data.
iweights, or "importance" weights.
The term "importance" is also made up by us and we intended it to be
vague.  In retrospect, it was a poor choice because of connotation.
Anyway, specify these kinds of weights and Stata will apply a formula
blindly -- which formula is supposed to be documented in the Methods
and Formulas section for each command -- and which formula may have no
statistical validity whatsover no matter how the data were gathered.
Why would anyone want this?  Because we have chosen the formula for
blind evaluation so that the result is a userful ingredient for
subsequent calculations that produce meaninful results for meaningful
kinds of weights.  Iweights are for programmers, not data analysts.

As an example, Bill Sribney <wsribney@stata.com>, who has programmed most of
the survey commands in Stata, uses iweights in his programs.  It is not that
iweights produce a correct sample weighted result, it is merely that the
iweight formula produce a result that, when Bill codes further
transformations, yields the results he seeks.

Since iweights are for programmers, when we put them into Stata, we included
no extra code to "protect" the programmer from mistakes.  For instance,
iweights can be negative.  You just take the formula, plug in the negative
values, and proceed to get what you get.  We did this for two reasons:  (1)
speed and (2) flexibility.  Since we at the StataCorp offices provide no
interpretaion as to the meaning of iweights, it might turn out that a clever
programmer somewhere wants to use them for an application where negative
numbers plugged into the formula do indeed yield an intermediate, useful
result.  And if not, the programmer can check the weights ahead of time to
verify that there are no negative numbers.

So what are the effect of specifying iweights with -regress-?  Think of the
regression estimator as (X'X)^(-1)X'y.  Stata actually uses a variation on
that formula -- pulling out means and then adjusting the result to be as if
they were never pulled out -- but that is merely for numerical stability.  The
intended result is to closely approximate as possible the formula
(X'X)^(-1)X'y calculated on an infinite-precision computer.
On this infinite-precision computer, an element of (X'X) (or X'y), let's call
it
X_{ik}, is
N
X_{ik} = Sum x_ij x_kj
j=1
For iweighted calculation, the element is calculated as
N
X_{ik} = Sum w_j x_ij x_kj
j=1
It does not matter whether w_j is an integer, or even whether w_j is positive.
Stata makes that calculation for each matrix element and then proceeds to
calculate the overall result as (X'X)^(-1)X'y.  (More correctly, the result
produced is as if we did that; the actual formulas are more complicted but
that complication is just for numerical stability.)  Let us also remember, no
claims are made about any statistical meaning of this calculation.  It is
merely a formula and we are describing it.

Now Anders observes that when he multiplies all his variables with his iweight
variable and then runs a regression on these iweighted variables, the result
is not the same as an iweighted regression.  That is, Anders compares
>    . reg y x1 x2 [iw=myweight]   /* iweighted observations */
and
. gen iw_y = myweight*y
. gen iw_x1 = myweight*x1
. gen iw_x2 = myweight*x2
>   . reg iw_y iw_x1 iw_x2        /* iweighted variables */
True.  If we wanted to reproduce what iweight does in this case -- and we
might want to that just to be sure we understand the above formulas -- here is
what we would need to type:
. gen iw_y = sqrt(myweight)*y
. gen iw_x1 = sqrt(myweight)*x1
. gen iw_x2 = sqrt(myweight)*x2
. gen cons = sqrt(myweight)
. reg iw_y iw_x1 iw_x2 cons, nocons

There are two issues Anders forgot,
1.  You multiply variables by the square root of the weight.
Thus, the terms we are summing to produce (X'X) are
(sqrt(w_j)x_ij)(sqrt(w_j)x_ik) = w_j*x_ij*x_kj.
2.  You must also multiply the intercept by the (square root of)
weight.   The model is really
y_j = x_1j*b1 + x_2j*b2 + b3
and multiplying through by v_j = sqrt(w_j) yields
v_j*y_j = v_j*x_1j*b1 + v_j*x_2j*b2 + v_j*b3

I am unsure why Anders wants to use iweights but I assume it is for some
good purpose.  Be careful.  There above exercise is useful for understanding,
but there are two more issues:
1.  The multiplied-through-and-regress model is not as accurate as
the [iweight] model.  There are numerical issues because, when
Stata estimates the [iweight] model, it does it all in deviation
form.  The iweighted estimates are more accurate.
2.  The multiplied-through-and-regress model's correspondence to
iweights breaks down if any the weights are negative.
-- Bill
wgould@stata.com
---------------------------------------- END ---------------------
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```