Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Rieza Soelaeman <rsoelaeman@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: Comparing multiple means with survey data--revisited |

Date |
Fri, 1 Jun 2012 11:56:42 -0500 |

OK, got it now. Thank you, Steve for all your help. Rieza On Thu, May 31, 2012 at 5:21 PM, Steve Samuels <sjsamuels@gmail.com> wrote: > > I still can't make the decision for you, because I don't know the theory and > background of your problem. > > • Your post states: "I don't think we necessarily can make inferences at the > population level, given the limited data we have." > > This is a misconception: with a well-executed probability sample, one can always > make inferences about the sampled population. If the sample size is limited, > then confidence intervals might be wide. But if the sample size is small, then > the power for tests will be poor. The CIs will provide some information, > the tests none. > > • You want to know if the means are equal. In a finite population, this null > hypothesis will never be true, so the answer in advance is "No". The real > question is "How, if at all, different", and this is answered by CIs. > > > If you wish to test a hypothesis of no difference, you need to think in terms of > a super-population or data-generating model in which the null hypothesis > could be true. Consider a comparison of two reported disease rates. With 100% > reporting, the SEs for the rates and their differences would be zero (there's no > sampling). Yet many practitioners would make inferences on the basis of a > Poisson model in which the "true" rates do not differ. Hypothesis tests would > also be a justified if you study the causal influence of the column variable on > your outcome, since in this case the specific population is of secondary > interest. > > • If you do use the -test- command, specify the mtest(noadjust) option, then > multiply the p-values for the pairwise tests by 15; multiply the overall 2 d.f. > p-value in each row by 5. These are the Bonferroni corrections. > > • If you compute CIs only (with -lincom-), you are free to add the finite > population correction (fpc()) option to your -svyset- statement. > > • Your colleague grouped the data into wide intervals. Plotting outcomes against > individual years would be more informative. > > Reference: Levy, P. S., & Lemeshow, S. (2008). Sampling of Populations : Methods > and Applications (4th ed.). Hoboken, N.J: Wiley. > > Steve sjsamuels@gmail.com > > On May 30, 2012, at 10:02 PM, Rieza Soelaeman wrote: > > Dear Steve, > Thank you for your reply. The values are negative, because they are > standardized scores, so individuals that are below the mean have > negative values, whereas individuals above the mean have positive > values. Our intention was to see whether the means in each row were > "equal," hence testing the equality of three means, rather than > pairwise comparisons (i.e., for individuals where Var_C = 0-9 months, > are -1.28, -0.57, and -0.36 statistically equal?). Given this > additional information, is my original -test- statement the correct > way to asses this? > > As far as using -svyset-, I wanted to analyze the data taking into > consideration the sampling design (2-stage cluster with urban/rural > stratification), so that the software "knows" to adjust for > homogeneity within the clusters. I don't think we necessarily can > make inferences at the population level, given the limited data we > have. > > Thank you also for the tip about using -subset- rather than -if- for > subsetting the subgroup. I am re-running the means with your code. > > Bests, > Rieza > > On Wed, May 30, 2012 at 11:25 AM, Steve Samuels <sjsamuels@gmail.com> wrote: > > Correction: > > "foreach x of local levels {" > > not > > "foreach x of local(levels) {" > > On May 30, 2012, at 12:05 PM, Steve Samuels wrote: > > Rieza: > > The means are negative and so don't appear to be "ordinary" descriptive > statistics. Only you can say whether the purpose of the table is descriptive of > a population (so that tests are not appropriate) or whether some causal > hypothesis is in play (eg. "that such-and-such an intervention will show > stronger effects for higher levels of variable B and for variable A"). > > The patterns are very clear: > 1. Means increase with row number. > 2. In each row, first column means are higher than third column means. > > Confidence intervals for differences are okay for descriptive tables, but even > if > there is a hypothesis floating around, such intervals would just confuse things > here. > There would be a minimum of 15 if you did separate tests in each row, and 105 > if all pairwise comparisons in the table are considered. Note that your -test- > statement tested equality of three means, not of two. > > I do suggest that you add standard errors to the table. > > Some alternative code: > > ********************* > // convert variable names to lower case for easier typing > rename VARIABLE_*, lower > > svy: mean variable_a, over(variable_b variable_c) > ********************* > > For easier copying, you can get the columns of the table with the > following code. > > ************************************************************* > levelsof variable_b, local(levels) > foreach x of local levels { > di "variable_b = `x'" > svy, subpop(if variable_b==`x'): mean variable_a, over(variable_c) > ***************************************************** > > For correct standard errors, use the -subpop- option to subset data, not the > -if- qualifier. > > Steve > sjsamuels@gmail.com > > On May 29, 2012, at 11:37 PM, Rieza Soelaeman wrote: > > Dear Stata-Lers, > I need your help in clarifying an earlier point made about testing the > difference between means in survey data (that is, you can't/shouldn't do > this, I have copied the thread at the end of this e-mail). I am trying to > replicate the work of a colleague who left recently. She created a table > where the rows represent levels of one variable, columns represent the > levels of another variable, and the cells contain the mean value of a third > variable for that row/column combination and the number of people in that > group. > > Example: > > In cells: Mean of Variable A (n) > > -------------------------------------------------------------------------------- > --------------------- > Variable B (years) > -------------------------------------------------------------------------------- > --------------------- > Variable C > (months) 5-10 11-15 16-20 Total p-value > -------------------------------------------------------------------------------- > --------------------- > 0-9 -1.28 (21) -0.57 (60) -0.36 (75) -0.57 (156) 0.032 > 10-18 -1.44 (30) -0.92 (47) -1.00 (54) -1.07 (132) 0.15 > 19-27 -1.95 (64) -1.68 (77) -1.63 (126) -1.72 (268) 0.314 > 28-36 -1.92 (51) -1.83 (52) -1.72 (104) -1.80 (206) 0.652 > 37-45 -1.96 (36) -2.01 (61) -1.65 (54) -1.87 (151) 0.107 > -------------------------------------------------------------------------------- > --------------------- > > Usng -svyset-, I was able to get the same means and ns in each cell, but was > not able to get the same significance level for the difference between the > means--she used SPSS to get the p-values. I suspect this is because I > specified the cluster, stratum, and pweights in my -svyset- command, whereas > the software she used allowed only for the specification of weights (to > specify a complex sampling design in SPSS requires an extension that costs > about $600). > > For those who are familiar with SPSS, she used the following syntax after > applying weights, and subsetting for a specific level of VARIABLE_C: > > MEANS TABLES= VARIABLE_A BY VARIABLE_B > /CELLS MEAN COUNT STDDEV > /STATISTICS ANOVA. > > I believe the equivalent in Stata to get the means and p-values is to use > the following code, but as Steve pointed out in the conversation copied > below from 2009, this is not theoretically correct: > > . svy: mean VARIABLE_A if (VARIABLE_C==4), over(VARIABLE_B) > > . test [VARIABLE_A]_subpop_1 = [VARIABLE_A]_subpop_2 = [VARIABLE_A]_subpop_3 > > My question is whether I should be attempting to compare the means using the > -svyset-/-test- commands at all (is what I am trying to do > legitimate), or if I should omit this comparison from my tables? > > Thanks, > Rieza > > -------------------------------------------------------------------------------- > --------------------- > > Re: st: comparing multiple means with survey data > > ________________________________ > From sjsamuels@gmail.com > To statalist@hsphsun2.harvard.edu > Subject Re: st: comparing multiple means with survey data > Date Tue, 23 Jun 2009 12:52:47 -0400 > ________________________________ > > Your syntax is correct. You don't need the "linearized" option, as it > is the default for -svy: mean-. > > However, hypothesis testing is usually not appropriate for finite > population studies. See: > http://www.stata.com/statalist/archive/2009-02/msg00806.html If > hypothesis testing is appropriate for your situation , then you should > exclude the finite population correction (fpc) option from your > -svyset- command. > > I'm guessing that you also (or only) want to know how different the > means in the categories of var2 are. Confidence intervals will > provide the answer, and you can keep the finite population correction > in your -svyset- statement if appropriate. > > It is poor practice (and cumbersome) to label categories with strings > like "var2a" "var2b". These are unnecessary as "a", "b", .. have no > descriptive value. Just make var2 a numeric variable with values 1 2 > 3 4 5. Use -label define- and -label values- to associate the > numeric values with descriptive text. > > Assuming that you do that, the easiest way to to get confidence > intervals for all pairwise differences after -svy: mean- is to write > out the 10 statements > > lincom _b[1] - _b[2] > lincom _b[1] - _b[3] > ... > lincom _b[4] - _b[5] > > For plotting continuous outcomew with groups I recommend -dotplot- > although it will not take weights. > > -Steve > > On Mon, Jun 22, 2009 at 4:31 PM, Jean-Gael Collomb <jg@ufl.edu> wrote: > > Hello - > I have been struggling to find a way to compare the means of a different > categories of one of my variable. I think I have found a way but I wonder > if there would be a more efficient way to do it. In the following example, > > var 2 has five categories (var2a-var2e). > Here's teh commands I type (after survey setting the data): > > svy linearized : mean var1, over(var2) > test [var1]var2a = [var1]var2b = [var1]var2c = [var1]var2d = [var1]var2e, > mtest(b) > test [var1]var2b = [var1]var2c = [var1]var2d = [var1]var2e, mtest(b) > test [var1]var2c = [var1]var2d = [var1]var2e, mtest(b) > test [var1]var2d = [var1]var2e, mtest(b) > > Is there a better way to do this? > > Thanks! > > Jean-Gael "JG" Collomb > PhD candidate > School of Natural Resources and Environment / School of Forest Resources > and > Conservation > University of Florida > > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/statalist/faq > * http://www.ats.ucla.edu/stat/stata/ > > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/statalist/faq > * http://www.ats.ucla.edu/stat/stata/ > > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/statalist/faq > * http://www.ats.ucla.edu/stat/stata/ > > > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/statalist/faq > * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

- Prev by Date:
**Re: st: RE: merge** - Next by Date:
**Re: st: RE: merge** - Previous by thread:
**st: ARIMA equation for MC simulation an time series operator reversion?** - Next by thread:
**st: How to combine multiple imputation with reg3 for simultaneous equations** - Index(es):