Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

# Re: st: Comparing multiple means with survey data--revisited

 From Rieza Soelaeman To statalist@hsphsun2.harvard.edu Subject Re: st: Comparing multiple means with survey data--revisited Date Wed, 30 May 2012 21:02:09 -0500

```Dear Steve,
Thank you for your reply.  The values are negative, because they are
standardized scores, so individuals that are below the mean have
negative values, whereas individuals above the mean have positive
values.  Our intention was to see whether the means in each row were
"equal," hence testing the equality of three means, rather than
pairwise comparisons (i.e., for individuals where Var_C = 0-9 months,
are -1.28, -0.57, and -0.36 statistically equal?).  Given this
additional information, is my original -test- statement the correct
way to asses this?

As far as using -svyset-, I wanted to analyze the data taking into
consideration the sampling design (2-stage cluster with urban/rural
stratification), so that the software "knows" to adjust for
homogeneity within the clusters.  I don't think we necessarily can
make inferences at the population level, given the limited data we
have.

Thank you also for the tip about using -subset- rather than -if- for
subsetting the subgroup.  I am re-running the means with your code.

Bests,
Rieza

On Wed, May 30, 2012 at 11:25 AM, Steve Samuels <sjsamuels@gmail.com> wrote:
>
>
> Correction:
>
> "foreach x of local levels {"
>
> not
>
> "foreach x of local(levels) {"
>
>
>
> On May 30, 2012, at 12:05 PM, Steve Samuels wrote:
>
>
>
> Rieza:
>
> The means are negative and so don't appear to be "ordinary" descriptive
> statistics. Only you can say whether the purpose of the table is descriptive of a population (so that tests are not appropriate) or whether some causal hypothesis is in play (eg. "that such-and-such an intervention will show stronger effects for higher levels of variable B and for variable A").
>
> The patterns are very clear:
> 1. Means increase with row number.
> 2. In each row, first column means are higher than third column means.
>
> Confidence intervals for differences are okay for descriptive tables, but even if
> there is a hypothesis floating around, such intervals would just confuse things here.
> There would be a minimum of 15 if you did separate tests in each row, and 105
> if all pairwise comparisons in the table are considered. Note that your -test-
> statement tested equality of three means, not of two.
>
> I do suggest that you add standard errors to the table.
>
> Some alternative code:
>
> *********************
> // convert variable names to lower case for easier typing
> rename VARIABLE_*, lower
>
> svy: mean variable_a, over(variable_b variable_c)
> *********************
>
> For easier copying, you can get the columns of the table with the
> following code.
>
> *************************************************************
> levelsof variable_b, local(levels)
> foreach x of local levels {
> di "variable_b = `x'"
> svy, subpop(if variable_b==`x'): mean variable_a, over(variable_c)
> *****************************************************
>
> For correct standard errors, use the -subpop- option to subset data, not the -if- qualifier.
>
> Steve
> sjsamuels@gmail.com
>
> On May 29, 2012, at 11:37 PM, Rieza Soelaeman wrote:
>
> Dear Stata-Lers,
> I need your help in clarifying an earlier point made about testing the
> difference between means in survey data (that is, you can't/shouldn't do
> this, I have copied the thread at the end of this e-mail).  I am trying to
> replicate the work of a colleague who left recently.  She created a table
> where the rows represent levels of one variable, columns represent the
> levels of another variable, and the cells contain the mean value of a third
> variable for that row/column combination and the number of people in that
> group.
>
> Example:
>
> In cells: Mean of Variable A (n)
>
> -----------------------------------------------------------------------------------------------------
>                                         Variable B (years)
> -----------------------------------------------------------------------------------------------------
> Variable C
> (months)    5-10         11-15          16-20          Total            p-value
> -----------------------------------------------------------------------------------------------------
> 0-9        -1.28 (21)    -0.57 (60)    -0.36 (75)    -0.57 (156)     0.032
> 10-18    -1.44 (30)    -0.92 (47)    -1.00 (54)    -1.07 (132)     0.15
> 19-27    -1.95 (64)    -1.68 (77)    -1.63 (126)  -1.72 (268)     0.314
> 28-36    -1.92 (51)    -1.83 (52)    -1.72 (104)  -1.80 (206)     0.652
> 37-45    -1.96 (36)    -2.01 (61)    -1.65 (54)    -1.87 (151)     0.107
> -----------------------------------------------------------------------------------------------------
>
> Usng -svyset-, I was able to get the same means and ns in each cell, but was
> not able to get the same significance level for the difference between the
> means--she used SPSS to get the p-values.  I suspect this is because I
> specified the cluster, stratum, and pweights in my -svyset- command, whereas
> the software she used allowed only for the specification of weights (to
> specify a complex sampling design in SPSS requires an extension that costs
>
> For those who are familiar with SPSS, she used the following syntax after
> applying weights, and subsetting for a specific level of VARIABLE_C:
>
> MEANS TABLES= VARIABLE_A BY VARIABLE_B
> /CELLS MEAN COUNT STDDEV
> /STATISTICS ANOVA.
>
> I believe the equivalent in Stata to get the means and p-values is to use
> the following code, but as Steve pointed out in the conversation copied
> below from 2009, this is not theoretically correct:
>
> . svy: mean VARIABLE_A if (VARIABLE_C==4), over(VARIABLE_B)
>
> . test [VARIABLE_A]_subpop_1 = [VARIABLE_A]_subpop_2 = [VARIABLE_A]_subpop_3
>
> My question is whether I should be attempting to compare the means using the
> -svyset-/-test- commands at all (is what I am trying to do
> legitimate), or if I should omit this comparison from my tables?
>
> Thanks,
> Rieza
>
> -----------------------------------------------------------------------------------------------------
>
> Re: st: comparing multiple means with survey data
>
> ________________________________
> From   sjsamuels@gmail.com
> To   statalist@hsphsun2.harvard.edu
> Subject   Re: st: comparing multiple means with survey data
> Date   Tue, 23 Jun 2009 12:52:47 -0400
> ________________________________
>
> Your syntax is correct.  You don't need the "linearized" option, as it
> is the default for -svy: mean-.
>
> However, hypothesis testing is usually not appropriate for finite
> population studies.  See:
> http://www.stata.com/statalist/archive/2009-02/msg00806.html   If
> hypothesis testing is appropriate for your situation , then you should
> exclude the finite population correction (fpc) option from your
> -svyset- command.
>
> I'm guessing that you also (or only) want to know how different the
> means in the categories of var2 are.   Confidence intervals will
> provide the answer, and you can keep the finite population correction
> in your -svyset- statement if appropriate.
>
> It is poor practice (and cumbersome) to label categories with strings
> like "var2a" "var2b". These are  unnecessary as "a", "b", .. have no
> descriptive value.  Just make var2 a numeric variable with values 1 2
> 3 4 5.  Use -label define- and -label values-  to associate the
> numeric values with descriptive text.
>
> Assuming that you do that, the easiest  way to to get confidence
> intervals for all pairwise differences after -svy: mean- is to write
> out the 10 statements
>
> lincom _b[1] - _b[2]
> lincom _b[1] - _b[3]
> ...
> lincom _b[4] - _b[5]
>
> For plotting continuous outcomew with groups I recommend -dotplot-
> although it will not take weights.
>
> -Steve
>
> On Mon, Jun 22, 2009 at 4:31 PM, Jean-Gael Collomb <jg@ufl.edu> wrote:
>
> Hello -
> I have been struggling to find a way to compare the means of a different
> categories of one of my variable. I think I have found a way but I wonder
> if there would be a more efficient way to do it. In the following example,
>
> var 2 has five categories (var2a-var2e).
> Here's teh commands I type (after survey setting the data):
>
> svy linearized : mean var1, over(var2)
> test [var1]var2a = [var1]var2b = [var1]var2c = [var1]var2d = [var1]var2e,
> mtest(b)
> test [var1]var2b = [var1]var2c = [var1]var2d = [var1]var2e, mtest(b)
> test [var1]var2c = [var1]var2d = [var1]var2e, mtest(b)
> test [var1]var2d = [var1]var2e, mtest(b)
>
> Is there a better way to do this?
>
> Thanks!
>
> Jean-Gael "JG" Collomb
> PhD candidate
> School of Natural Resources and Environment / School of Forest Resources
> and
> Conservation
> University of Florida
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>
>
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```