Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down at the end of May, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Weighting on Sub-samples of Complex Survey Data and Specifying Correlation for PA Models


From   Austin Nichols <austinnichols@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Weighting on Sub-samples of Complex Survey Data and Specifying Correlation for PA Models
Date   Thu, 6 May 2010 12:56:02 -0400

Ryan McCann <rmccann@keybridgeresearch.com> :
With some strong assumptions, you can "adjust" your weights to account
for nonresponse (in this case, not being included in your estimation
sample).  One easy way is to create a dummy
g insample=e(sample)
and then compute the mean of that dummy within groups of covariates (a
nonparametric estimate of response probability) or run a logit of
insample on covariates (a parametric estimate of response
probability), then multiply the reciprocal of that estimated
probability by your original sample weight.  I.e. if 100 obs in a
group with AssetTurnover==1 has pweight=10 and 4 of them are in your
estimation sample, then their new weight is 10*(100/4)=250 (this way,
the 4 obs add up to a total weight of 100*10 just as the original 100
obs do).  But it is probably not defensible to do this over all obs,
but rather to construct these weights for each panel (or perhaps each
survey cluster) at one point in time--is there a time period at which
all panels are observed and included in the estimation sample?

You should be using a cluster-robust VCE, which is robust to arbitrary
autocorrelation.  How many firms are in your estimation sample? 250?
Plenty of clusters, enough for the cluster-robust VCE to dominate the
alternatives. Probably you want to cluster by an even larger
identifier, with fewer distinct levels, such as state, since firms
within each state may have correlated errors.

In any case, you have far more serious problems.  The main variable of
interest (lnCreditCard) is certainly endogenous and you cannot
interpret its coefficient in any regression as a true causal impact;
certainly I would not trust an estimate that uses the between
variation, and I would have qualms about one exploiting only the
within variation (without using a plausible exclusion restriction and
an Instrumental Variables estimation strategy).  Even if revenue has
no impact on credit card transactions (reverse causality is almost
certain however), what would the interpretation of the coefficient be?
 I am sure you can write down a model where otherwise identical
profit-maximizing firms are indifferent among several business
strategies (with identical profits) with various levels of total
revenue and credit card fraction--as the credit card transactions
produce lower profit per transaction, a positive association between
total revenue and credit card fraction tells you nothing about the
level of profits for any of those data pairs.

Beyond that, you are estimating a regression of ln(Revenue) on X, and
if you want to know the (marginal) effect of X on Revenue, or really,
expected Revenue given some level of X, then you need a -poisson-
regression (or -glm-, or -ivpois- on SSC for Stata 10, or -gmm- in
Stata 11).  Note that the derivative of the log of expected Revenue
given some level of X is not the same as the derivative of the
expected log Revenue given some level of X, and the first is what you
want and what -poisson- et al. give you.  If you're willing to assume
homoskedastic iiid errors, OLS of lnY on X is reasonable, but
otherwise you should look for a GLM or GMM solution.  This is a lesser
concern than the endogeneity concern, however.

On Thu, May 6, 2010 at 11:37 AM, Ryan McCann
<rmccann@keybridgeresearch.com> wrote:
> Dear Statalist Community Members,
>
> I’m working with a small business firm survey (a 4 year panel of about 5000
> small business start-ups which includes financial, geographic, and owner
> data).  I am trying to assess the impact of credit card use on revenues.
> The regression at present looks like this:
>
> xtreg lnRevenue  lnCreditCard lnAssests AssetTurnover NetMargin
> HumanCapitalVars [pweight=final longitudinal weight], pa corr(exchangeable)
>
>  I am running into two significant problems:
>
> Firstly, there are a large number of missing values, so that when the
> regression is fully specified, I am left with about 1200 observations out of
> a total of 24,000 when the data is in long form.  Since the data comes from
> a complex survey we need to use weights.  Given the fact that the regression
> is only being run on small subset of the full sample (and a t-test of means
> shows there is most likely some selection bias) it seems intuitive that the
> weights will not provide an accurate metric for arriving at unbiased
> estimates. Is there any consensus on how to handle this type of situation?
> (Imputations of missing data have already been done to the extent I am
> comfortable, and the resulting subsample is still very small compared to the
> original).
>
> Secondly, the random effects model would seem more appropriate than fixed
> effects because most of the variation in the sample is between as opposed to
> within  (the panel is not that wide to begin with (average time series for
> an individual is only 2.5 periods).  STATA does not allow for the use of
> weights with RE so we are using a Pooled Average regression.  At this point
> I’m trying to determine the type of autocorrelation that is present.  The
> “pa” regression in STATA allows for Independent, Exchangeable, Unstructured,
> and AR error correlations over time.  I’ve run regressions by year and
> predicted the error terms for each time period.  I then regressed these
> errors on their lags and (t-2) lags and have come out with fairly consistent
> coefficients on the lag term (around .57, a couple of the coeffecients came
> out to be around .3) (I used this method in the absence of knowing and
> Durbin-Watson type test that allows for weights).  The error correlations I
> get by using the exchangeable option come out around .59.  It appears that
> the independent option (i.e. no autocorrelation) is not appropriate, but I’m
> wondering how I choose between Exchangeable and Unstructured (not sure if AR
> process is present).
>
> Any suggestions are greatly appreciated.
>
> Best Regards,
> Ryan
>
>
> Ryan McCann
> Senior Analyst
> Keybridge Research LLC
> Office: 202.965.9487 | Mobile: 774.521.8874

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index