[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
"Anders Alexandersson" <aalex@its.msstate.edu> |

To |
<statalist@hsphsun2.harvard.edu> |

Subject |
Re: Re: st: importance weights |

Date |
Wed, 19 Feb 2003 11:10:32 -0600 |

"B. Burcin Yurtoglu" <burcin.yurtoglu@u...> wrote: > Many thanks to Anders Alexandersson for the response on iweight. > > I was just curious about how the iweight functions. I do not have > to use it. But, still I find it difficult to understand why and the > how the number of observations change. Here is Bill Gould's answer from over 5 years ago to my very similar question at the time. Use the source, ... Anders ---------------------------------------- BEGIN --------------------- From: "William Gould" <wgould@stata.com> To: statalist@hsphsun2.harvard.edu Subject:Re: statalist: iweights and regress Date: Fri, 30 Jan 1998 10:11:57 -0600 Anders Alexandersson <andersa@rocketmail.com> asked for a clarification on iweights. Stay away from them, I say, because they will invariably surprise you. Let me explain: Stata provides four kinds of weights which are best described in terms of their intended use: fweights, or frequency weights, or duplication weights. Specify these and Stata is supposed to produce the same answers as if you replace each observation j with w_j copies of itself. These are useful when the data is stored in a compressed way. pweights, or sampling weights, or population weights. Specify these and Stata is supposed to produce the right answers for survey-sampled data. w_j means that this observation is random draw from a population of w_j similar observations. aweights, or analytic weights. The term "analytic" is made up by us. There is no commonly used term for what these weights indicate even though the problem they handle arises, and is discussed commonly. The data are means. You do not observe individual y's and X's; you observe average values of y's and X's with the averages being calculated over w_j observations. Specify these kinds of weights and Stata is supposed to produce the correct answer for these kinds of data. iweights, or "importance" weights. The term "importance" is also made up by us and we intended it to be vague. In retrospect, it was a poor choice because of connotation. Anyway, specify these kinds of weights and Stata will apply a formula blindly -- which formula is supposed to be documented in the Methods and Formulas section for each command -- and which formula may have no statistical validity whatsover no matter how the data were gathered. Why would anyone want this? Because we have chosen the formula for blind evaluation so that the result is a userful ingredient for subsequent calculations that produce meaninful results for meaningful kinds of weights. Iweights are for programmers, not data analysts. As an example, Bill Sribney <wsribney@stata.com>, who has programmed most of the survey commands in Stata, uses iweights in his programs. It is not that iweights produce a correct sample weighted result, it is merely that the iweight formula produce a result that, when Bill codes further transformations, yields the results he seeks. Since iweights are for programmers, when we put them into Stata, we included no extra code to "protect" the programmer from mistakes. For instance, iweights can be negative. You just take the formula, plug in the negative values, and proceed to get what you get. We did this for two reasons: (1) speed and (2) flexibility. Since we at the StataCorp offices provide no interpretaion as to the meaning of iweights, it might turn out that a clever programmer somewhere wants to use them for an application where negative numbers plugged into the formula do indeed yield an intermediate, useful result. And if not, the programmer can check the weights ahead of time to verify that there are no negative numbers. So what are the effect of specifying iweights with -regress-? Think of the regression estimator as (X'X)^(-1)X'y. Stata actually uses a variation on that formula -- pulling out means and then adjusting the result to be as if they were never pulled out -- but that is merely for numerical stability. The intended result is to closely approximate as possible the formula (X'X)^(-1)X'y calculated on an infinite-precision computer. On this infinite-precision computer, an element of (X'X) (or X'y), let's call it X_{ik}, is N X_{ik} = Sum x_ij x_kj j=1 For iweighted calculation, the element is calculated as N X_{ik} = Sum w_j x_ij x_kj j=1 It does not matter whether w_j is an integer, or even whether w_j is positive. Stata makes that calculation for each matrix element and then proceeds to calculate the overall result as (X'X)^(-1)X'y. (More correctly, the result produced is as if we did that; the actual formulas are more complicted but that complication is just for numerical stability.) Let us also remember, no claims are made about any statistical meaning of this calculation. It is merely a formula and we are describing it. Now Anders observes that when he multiplies all his variables with his iweight variable and then runs a regression on these iweighted variables, the result is not the same as an iweighted regression. That is, Anders compares > . reg y x1 x2 [iw=myweight] /* iweighted observations */ and . gen iw_y = myweight*y . gen iw_x1 = myweight*x1 . gen iw_x2 = myweight*x2 > . reg iw_y iw_x1 iw_x2 /* iweighted variables */ True. If we wanted to reproduce what iweight does in this case -- and we might want to that just to be sure we understand the above formulas -- here is what we would need to type: . gen iw_y = sqrt(myweight)*y . gen iw_x1 = sqrt(myweight)*x1 . gen iw_x2 = sqrt(myweight)*x2 . gen cons = sqrt(myweight) . reg iw_y iw_x1 iw_x2 cons, nocons There are two issues Anders forgot, 1. You multiply variables by the square root of the weight. Thus, the terms we are summing to produce (X'X) are (sqrt(w_j)x_ij)(sqrt(w_j)x_ik) = w_j*x_ij*x_kj. 2. You must also multiply the intercept by the (square root of) weight. The model is really y_j = x_1j*b1 + x_2j*b2 + b3 and multiplying through by v_j = sqrt(w_j) yields v_j*y_j = v_j*x_1j*b1 + v_j*x_2j*b2 + v_j*b3 I am unsure why Anders wants to use iweights but I assume it is for some good purpose. Be careful. There above exercise is useful for understanding, but there are two more issues: 1. The multiplied-through-and-regress model is not as accurate as the [iweight] model. There are numerical issues because, when Stata estimates the [iweight] model, it does it all in deviation form. The iweighted estimates are more accurate. 2. The multiplied-through-and-regress model's correspondence to iweights breaks down if any the weights are negative. -- Bill wgould@stata.com ---------------------------------------- END --------------------- * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

- Prev by Date:
**st: How do I know the intersecting point?** - Next by Date:
**st: RE: Regplot** - Previous by thread:
**Re: st: importance weights** - Next by thread:
**st: How to Obtain Variance Decomposition and Impulse Response Function in VAR?** - Index(es):

© Copyright 1996–2014 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |