[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: Chi square test unavailable when subpop is used in svy analyisis

From   [email protected] (Jeff Pitblado, StataCorp LP)
To   [email protected]
Subject   Re: st: Chi square test unavailable when subpop is used in svy analyisis
Date   Tue, 19 Aug 2008 17:57:03 -0500

�ngel Rodr�guez Laso <[email protected]> has a follow-up question regarding
-svy: tabulate- with the -subpop()- option:

> Following with this, I have a query: If there are missing values in a
> variable and SEs and CIs for the valid values are wanted, how should
> one proceed? Are individuals with missing values dropped from the
> calculations of SEs if subpop is not used? I see four possibilities:
> 1) svy:tab variable */intuitive option
> 2) svy, subpop (valid values): tab variable */probably most accurate
> 3) svy if variable==valid values: tab variable */not recommended for svy
> 4) svy: tab variable, missing */ but then you don�t get proportions of
> valid values after excluding missing values
> In an example with a dichotomous variable with 5.7% missing values, I
> get exactly (up to three decimal figures) the same SEs, CIs and number
> of observations (n=11500, degrees of freedom=1255) with options 1, 2
> and 3, and slightly smaller SEs with option 4 (n=12190, df=1255).

In reviewing �ngel's results, we noticed that -svy: tabulate- is incorrectly
dropping out-of-subpop observations that contain missing values in the
variables of the varlist (Option 2 should be different from options 1 and 3).
This affects the variance values when primary sampling units are are dropped
because of missing values and could decrease the design degrees of freedom.
Both of these effects are very slight and inversely related to the number of
PSUs.  We will correct this in the next Stata ado-file update.

In light of this, we'll address �ngel's observations using -svy: proportion-,
which is very similar to -svy: tabulate- and correctly deals with missing
values in out-of-subpop observations.

In the following we assume that the only variable with missing values is the
one we are tabulating.  Here is a simple example that illustrates the
differences among the 4 options delineated by �ngel.

	. sysuse auto
	. svyset _n
	. * 1 
	. svy: prop rep
	. est store noopts
	. * 2
	. gen valid = !missing(rep)
	. svy, subpop(valid): prop rep
	. est store subpop
	. * 3
	. svy: prop rep if valid
	. est store withif
	. * 4
	. svy: prop rep, missing
	. est store missing
	. est table _all, b se

***** BEGIN: final output from above illustrative example
. est table _all, b se

    Variable |   noopts       subpop       withif      missing    
           1 |  .02898551    .02898551    .02898551    .02702703  
             |  .02034459    .02033449    .02034459    .01897965  
           2 |  .11594203    .11594203    .11594203    .10810811  
             |  .03882454    .03880527    .03882454    .03634325  
           3 |  .43478261    .43478261    .43478261    .40540541  
             |   .0601159    .06008606     .0601159    .05746373  
           4 |  .26086957    .26086957    .26086957    .24324324  
             |  .05324978    .05322334    .05324978    .05021542  
           5 |  .15942029    .15942029    .15942029    .14864865  
             |  .04439221    .04437017    .04439221    .04163643  
     _prop_6 |                                         .06756757  
             |                                         .02937761  
                                                      legend: b/se
***** END:

Summary of options (illustrated by above example using the auto data):

- Options 1 (noopts) and 3 (withif)  are equivalent.  Stata's -svy- commands
  drop within-subpop observations containing missing values.  In this case,
  the "subpop" is the entire population, and option 3 merely explicitly
  excludes the observations that option 1 dropped because of missing values.

- Option 2 (subpop) differs by treating the observations where the tabulated
  variables contain missing values as out-of-subpop.  Thus we are defining the
  subpop as the collection of individuals in the population for which we are
  able to collect information on the tabulated variable.  While this results
  in the same point estimates for any survey design, the variance estimates
  can vary depending upon the number of PSU that are dropped by options 1 and

- Option 4 (missing) merely treats the missing values as a separate category,
  potentially biasing the point estimates and standard errors downward (toward
  zero).  The -missing- option should only be used in cases where the missing
  values mean something like "not applicable" rather than "we couldn't get a
  value from the survey participant".

The option to choose is largely dependent on the reason for missing values in
the data.

[email protected]
*   For searches and help try:

© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index