Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | Steve Samuels <sjsamuels@gmail.com> |
To | statalist@hsphsun2.harvard.edu |
Subject | Re: st: Comparing multiple means with survey data--revisited |
Date | Thu, 31 May 2012 18:21:03 -0400 |
I still can't make the decision for you, because I don't know the theory and background of your problem. • Your post states: "I don't think we necessarily can make inferences at the population level, given the limited data we have." This is a misconception: with a well-executed probability sample, one can always make inferences about the sampled population. If the sample size is limited, then confidence intervals might be wide. But if the sample size is small, then the power for tests will be poor. The CIs will provide some information, the tests none. • You want to know if the means are equal. In a finite population, this null hypothesis will never be true, so the answer in advance is "No". The real question is "How, if at all, different", and this is answered by CIs. If you wish to test a hypothesis of no difference, you need to think in terms of a super-population or data-generating model in which the null hypothesis could be true. Consider a comparison of two reported disease rates. With 100% reporting, the SEs for the rates and their differences would be zero (there's no sampling). Yet many practitioners would make inferences on the basis of a Poisson model in which the "true" rates do not differ. Hypothesis tests would also be a justified if you study the causal influence of the column variable on your outcome, since in this case the specific population is of secondary interest. • If you do use the -test- command, specify the mtest(noadjust) option, then multiply the p-values for the pairwise tests by 15; multiply the overall 2 d.f. p-value in each row by 5. These are the Bonferroni corrections. • If you compute CIs only (with -lincom-), you are free to add the finite population correction (fpc()) option to your -svyset- statement. • Your colleague grouped the data into wide intervals. Plotting outcomes against individual years would be more informative. Reference: Levy, P. S., & Lemeshow, S. (2008). Sampling of Populations : Methods and Applications (4th ed.). Hoboken, N.J: Wiley. Steve sjsamuels@gmail.com On May 30, 2012, at 10:02 PM, Rieza Soelaeman wrote: Dear Steve, Thank you for your reply. The values are negative, because they are standardized scores, so individuals that are below the mean have negative values, whereas individuals above the mean have positive values. Our intention was to see whether the means in each row were "equal," hence testing the equality of three means, rather than pairwise comparisons (i.e., for individuals where Var_C = 0-9 months, are -1.28, -0.57, and -0.36 statistically equal?). Given this additional information, is my original -test- statement the correct way to asses this? As far as using -svyset-, I wanted to analyze the data taking into consideration the sampling design (2-stage cluster with urban/rural stratification), so that the software "knows" to adjust for homogeneity within the clusters. I don't think we necessarily can make inferences at the population level, given the limited data we have. Thank you also for the tip about using -subset- rather than -if- for subsetting the subgroup. I am re-running the means with your code. Bests, Rieza On Wed, May 30, 2012 at 11:25 AM, Steve Samuels <sjsamuels@gmail.com> wrote: Correction: "foreach x of local levels {" not "foreach x of local(levels) {" On May 30, 2012, at 12:05 PM, Steve Samuels wrote: Rieza: The means are negative and so don't appear to be "ordinary" descriptive statistics. Only you can say whether the purpose of the table is descriptive of a population (so that tests are not appropriate) or whether some causal hypothesis is in play (eg. "that such-and-such an intervention will show stronger effects for higher levels of variable B and for variable A"). The patterns are very clear: 1. Means increase with row number. 2. In each row, first column means are higher than third column means. Confidence intervals for differences are okay for descriptive tables, but even if there is a hypothesis floating around, such intervals would just confuse things here. There would be a minimum of 15 if you did separate tests in each row, and 105 if all pairwise comparisons in the table are considered. Note that your -test- statement tested equality of three means, not of two. I do suggest that you add standard errors to the table. Some alternative code: ********************* // convert variable names to lower case for easier typing rename VARIABLE_*, lower svy: mean variable_a, over(variable_b variable_c) ********************* For easier copying, you can get the columns of the table with the following code. ************************************************************* levelsof variable_b, local(levels) foreach x of local levels { di "variable_b = `x'" svy, subpop(if variable_b==`x'): mean variable_a, over(variable_c) ***************************************************** For correct standard errors, use the -subpop- option to subset data, not the -if- qualifier. Steve sjsamuels@gmail.com On May 29, 2012, at 11:37 PM, Rieza Soelaeman wrote: Dear Stata-Lers, I need your help in clarifying an earlier point made about testing the difference between means in survey data (that is, you can't/shouldn't do this, I have copied the thread at the end of this e-mail). I am trying to replicate the work of a colleague who left recently. She created a table where the rows represent levels of one variable, columns represent the levels of another variable, and the cells contain the mean value of a third variable for that row/column combination and the number of people in that group. Example: In cells: Mean of Variable A (n) -------------------------------------------------------------------------------- --------------------- Variable B (years) -------------------------------------------------------------------------------- --------------------- Variable C (months) 5-10 11-15 16-20 Total p-value -------------------------------------------------------------------------------- --------------------- 0-9 -1.28 (21) -0.57 (60) -0.36 (75) -0.57 (156) 0.032 10-18 -1.44 (30) -0.92 (47) -1.00 (54) -1.07 (132) 0.15 19-27 -1.95 (64) -1.68 (77) -1.63 (126) -1.72 (268) 0.314 28-36 -1.92 (51) -1.83 (52) -1.72 (104) -1.80 (206) 0.652 37-45 -1.96 (36) -2.01 (61) -1.65 (54) -1.87 (151) 0.107 -------------------------------------------------------------------------------- --------------------- Usng -svyset-, I was able to get the same means and ns in each cell, but was not able to get the same significance level for the difference between the means--she used SPSS to get the p-values. I suspect this is because I specified the cluster, stratum, and pweights in my -svyset- command, whereas the software she used allowed only for the specification of weights (to specify a complex sampling design in SPSS requires an extension that costs about $600). For those who are familiar with SPSS, she used the following syntax after applying weights, and subsetting for a specific level of VARIABLE_C: MEANS TABLES= VARIABLE_A BY VARIABLE_B /CELLS MEAN COUNT STDDEV /STATISTICS ANOVA. I believe the equivalent in Stata to get the means and p-values is to use the following code, but as Steve pointed out in the conversation copied below from 2009, this is not theoretically correct: . svy: mean VARIABLE_A if (VARIABLE_C==4), over(VARIABLE_B) . test [VARIABLE_A]_subpop_1 = [VARIABLE_A]_subpop_2 = [VARIABLE_A]_subpop_3 My question is whether I should be attempting to compare the means using the -svyset-/-test- commands at all (is what I am trying to do legitimate), or if I should omit this comparison from my tables? Thanks, Rieza -------------------------------------------------------------------------------- --------------------- Re: st: comparing multiple means with survey data ________________________________