Notice: On March 31, it was **announced** that Statalist is moving from an email list to a **forum**. The old list will shut down on April 23, and its replacement, **statalist.org** is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Austin Nichols <austinnichols@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: Weighting on Sub-samples of Complex Survey Data and Specifying Correlation for PA Models |

Date |
Thu, 6 May 2010 12:56:02 -0400 |

Ryan McCann <rmccann@keybridgeresearch.com> : With some strong assumptions, you can "adjust" your weights to account for nonresponse (in this case, not being included in your estimation sample). One easy way is to create a dummy g insample=e(sample) and then compute the mean of that dummy within groups of covariates (a nonparametric estimate of response probability) or run a logit of insample on covariates (a parametric estimate of response probability), then multiply the reciprocal of that estimated probability by your original sample weight. I.e. if 100 obs in a group with AssetTurnover==1 has pweight=10 and 4 of them are in your estimation sample, then their new weight is 10*(100/4)=250 (this way, the 4 obs add up to a total weight of 100*10 just as the original 100 obs do). But it is probably not defensible to do this over all obs, but rather to construct these weights for each panel (or perhaps each survey cluster) at one point in time--is there a time period at which all panels are observed and included in the estimation sample? You should be using a cluster-robust VCE, which is robust to arbitrary autocorrelation. How many firms are in your estimation sample? 250? Plenty of clusters, enough for the cluster-robust VCE to dominate the alternatives. Probably you want to cluster by an even larger identifier, with fewer distinct levels, such as state, since firms within each state may have correlated errors. In any case, you have far more serious problems. The main variable of interest (lnCreditCard) is certainly endogenous and you cannot interpret its coefficient in any regression as a true causal impact; certainly I would not trust an estimate that uses the between variation, and I would have qualms about one exploiting only the within variation (without using a plausible exclusion restriction and an Instrumental Variables estimation strategy). Even if revenue has no impact on credit card transactions (reverse causality is almost certain however), what would the interpretation of the coefficient be? I am sure you can write down a model where otherwise identical profit-maximizing firms are indifferent among several business strategies (with identical profits) with various levels of total revenue and credit card fraction--as the credit card transactions produce lower profit per transaction, a positive association between total revenue and credit card fraction tells you nothing about the level of profits for any of those data pairs. Beyond that, you are estimating a regression of ln(Revenue) on X, and if you want to know the (marginal) effect of X on Revenue, or really, expected Revenue given some level of X, then you need a -poisson- regression (or -glm-, or -ivpois- on SSC for Stata 10, or -gmm- in Stata 11). Note that the derivative of the log of expected Revenue given some level of X is not the same as the derivative of the expected log Revenue given some level of X, and the first is what you want and what -poisson- et al. give you. If you're willing to assume homoskedastic iiid errors, OLS of lnY on X is reasonable, but otherwise you should look for a GLM or GMM solution. This is a lesser concern than the endogeneity concern, however. On Thu, May 6, 2010 at 11:37 AM, Ryan McCann <rmccann@keybridgeresearch.com> wrote: > Dear Statalist Community Members, > > I’m working with a small business firm survey (a 4 year panel of about 5000 > small business start-ups which includes financial, geographic, and owner > data). I am trying to assess the impact of credit card use on revenues. > The regression at present looks like this: > > xtreg lnRevenue lnCreditCard lnAssests AssetTurnover NetMargin > HumanCapitalVars [pweight=final longitudinal weight], pa corr(exchangeable) > > I am running into two significant problems: > > Firstly, there are a large number of missing values, so that when the > regression is fully specified, I am left with about 1200 observations out of > a total of 24,000 when the data is in long form. Since the data comes from > a complex survey we need to use weights. Given the fact that the regression > is only being run on small subset of the full sample (and a t-test of means > shows there is most likely some selection bias) it seems intuitive that the > weights will not provide an accurate metric for arriving at unbiased > estimates. Is there any consensus on how to handle this type of situation? > (Imputations of missing data have already been done to the extent I am > comfortable, and the resulting subsample is still very small compared to the > original). > > Secondly, the random effects model would seem more appropriate than fixed > effects because most of the variation in the sample is between as opposed to > within (the panel is not that wide to begin with (average time series for > an individual is only 2.5 periods). STATA does not allow for the use of > weights with RE so we are using a Pooled Average regression. At this point > I’m trying to determine the type of autocorrelation that is present. The > “pa” regression in STATA allows for Independent, Exchangeable, Unstructured, > and AR error correlations over time. I’ve run regressions by year and > predicted the error terms for each time period. I then regressed these > errors on their lags and (t-2) lags and have come out with fairly consistent > coefficients on the lag term (around .57, a couple of the coeffecients came > out to be around .3) (I used this method in the absence of knowing and > Durbin-Watson type test that allows for weights). The error correlations I > get by using the exchangeable option come out around .59. It appears that > the independent option (i.e. no autocorrelation) is not appropriate, but I’m > wondering how I choose between Exchangeable and Unstructured (not sure if AR > process is present). > > Any suggestions are greatly appreciated. > > Best Regards, > Ryan > > > Ryan McCann > Senior Analyst > Keybridge Research LLC > Office: 202.965.9487 | Mobile: 774.521.8874 * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**References**:**st: Weighting on Sub-samples of Complex Survey Data and Specifying Correlation for PA Models***From:*Ryan McCann <rmccann@keybridgeresearch.com>

- Prev by Date:
**Re: st: why messy when importing a csv file?** - Next by Date:
**RE: st: why messy when importing a csv file?** - Previous by thread:
**st: Weighting on Sub-samples of Complex Survey Data and Specifying Correlation for PA Models** - Next by thread:
**Re: st: Weighting on Sub-samples of Complex Survey Data and Specifying Correlation for PA Models** - Index(es):