Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Comparing multiple means with survey data--revisited

From	Rieza Soelaeman <[email protected]>
To	[email protected]
Subject	Re: st: Comparing multiple means with survey data--revisited
Date	Fri, 1 Jun 2012 11:56:42 -0500
OK, got it now.  Thank you, Steve for all your help.

Rieza

On Thu, May 31, 2012 at 5:21 PM, Steve Samuels <[email protected]> wrote:
>
> I still can't make the decision for you, because I don't know the theory and
> background of your problem.
>
> • Your post states: "I don't think we necessarily can make inferences at the
> population level, given the limited data we have."
>
> This is a misconception: with a well-executed probability sample, one can always
> make inferences about the sampled population. If the sample size is limited,
> then confidence intervals might be wide. But if the sample size is small, then
> the power for tests will be poor. The CIs will provide some information,
> the tests none.
>
> • You want to know if the means are equal. In a finite population, this null
> hypothesis will never be true, so the answer in advance is "No". The real
> question is "How, if at all, different", and this is answered by CIs.
>
>
> If you wish to test a hypothesis of no difference, you need to think in terms of
> a super-population or data-generating model in which the null hypothesis
> could be true. Consider a comparison of two reported disease rates. With 100%
> reporting, the SEs for the rates and their differences would be zero (there's no
> sampling). Yet many practitioners would make inferences on the basis of a
> Poisson model in which the "true" rates do not differ. Hypothesis tests would
> also be a justified if you study the causal influence of the column variable on
> your outcome, since in this case the specific population is of secondary
> interest.
>
> • If you do use the -test- command, specify the mtest(noadjust) option, then
> multiply the p-values for the pairwise tests by 15; multiply the overall 2 d.f.
> p-value in each row by 5. These are the Bonferroni corrections.
>
> • If you compute CIs only (with -lincom-), you are free to add the finite
> population correction (fpc()) option to your -svyset- statement.
>
> • Your colleague grouped the data into wide intervals. Plotting outcomes against
> individual years would be more informative.
>
> Reference: Levy, P. S., & Lemeshow, S. (2008). Sampling of Populations : Methods
> and Applications (4th ed.). Hoboken, N.J: Wiley.
>
> Steve [email protected]
>
> On May 30, 2012, at 10:02 PM, Rieza Soelaeman wrote:
>
> Dear Steve,
> Thank you for your reply. The values are negative, because they are
> standardized scores, so individuals that are below the mean have
> negative values, whereas individuals above the mean have positive
> values. Our intention was to see whether the means in each row were
> "equal," hence testing the equality of three means, rather than
> pairwise comparisons (i.e., for individuals where Var_C = 0-9 months,
> are -1.28, -0.57, and -0.36 statistically equal?). Given this
> additional information, is my original -test- statement the correct
> way to asses this?
>
> As far as using -svyset-, I wanted to analyze the data taking into
> consideration the sampling design (2-stage cluster with urban/rural
> stratification), so that the software "knows" to adjust for
> homogeneity within the clusters. I don't think we necessarily can
> make inferences at the population level, given the limited data we
> have.
>
> Thank you also for the tip about using -subset- rather than -if- for
> subsetting the subgroup. I am re-running the means with your code.
>
> Bests,
> Rieza
>
> On Wed, May 30, 2012 at 11:25 AM, Steve Samuels <[email protected]> wrote:
>
> Correction:
>
> "foreach x of local levels {"
>
> not
>
> "foreach x of local(levels) {"
>
> On May 30, 2012, at 12:05 PM, Steve Samuels wrote:
>
> Rieza:
>
> The means are negative and so don't appear to be "ordinary" descriptive
> statistics. Only you can say whether the purpose of the table is descriptive of
> a population (so that tests are not appropriate) or whether some causal
> hypothesis is in play (eg. "that such-and-such an intervention will show
> stronger effects for higher levels of variable B and for variable A").
>
> The patterns are very clear:
> 1. Means increase with row number.
> 2. In each row, first column means are higher than third column means.
>
> Confidence intervals for differences are okay for descriptive tables, but even
> if
> there is a hypothesis floating around, such intervals would just confuse things
> here.
> There would be a minimum of 15 if you did separate tests in each row, and 105
> if all pairwise comparisons in the table are considered. Note that your -test-
> statement tested equality of three means, not of two.
>
> I do suggest that you add standard errors to the table.
>
> Some alternative code:
>
> *********************
> // convert variable names to lower case for easier typing
> rename VARIABLE_*, lower
>
> svy: mean variable_a, over(variable_b variable_c)
> *********************
>
> For easier copying, you can get the columns of the table with the
> following code.
>
> *************************************************************
> levelsof variable_b, local(levels)
> foreach x of local levels {
> di "variable_b = `x'"
> svy, subpop(if variable_b==`x'): mean variable_a, over(variable_c)
> *****************************************************
>
> For correct standard errors, use the -subpop- option to subset data, not the
> -if- qualifier.
>
> Steve
> [email protected]
>
> On May 29, 2012, at 11:37 PM, Rieza Soelaeman wrote:
>
> Dear Stata-Lers,
> I need your help in clarifying an earlier point made about testing the
> difference between means in survey data (that is, you can't/shouldn't do
> this, I have copied the thread at the end of this e-mail). I am trying to
> replicate the work of a colleague who left recently. She created a table
> where the rows represent levels of one variable, columns represent the
> levels of another variable, and the cells contain the mean value of a third
> variable for that row/column combination and the number of people in that
> group.
>
> Example:
>
> In cells: Mean of Variable A (n)
>
> --------------------------------------------------------------------------------
> ---------------------
>  Variable B (years)
> --------------------------------------------------------------------------------
> ---------------------
> Variable C
> (months) 5-10 11-15 16-20 Total p-value
> --------------------------------------------------------------------------------
> ---------------------
> 0-9 -1.28 (21) -0.57 (60) -0.36 (75) -0.57 (156) 0.032
> 10-18 -1.44 (30) -0.92 (47) -1.00 (54) -1.07 (132) 0.15
> 19-27 -1.95 (64) -1.68 (77) -1.63 (126) -1.72 (268) 0.314
> 28-36 -1.92 (51) -1.83 (52) -1.72 (104) -1.80 (206) 0.652
> 37-45 -1.96 (36) -2.01 (61) -1.65 (54) -1.87 (151) 0.107
> --------------------------------------------------------------------------------
> ---------------------
>
> Usng -svyset-, I was able to get the same means and ns in each cell, but was
> not able to get the same significance level for the difference between the
> means--she used SPSS to get the p-values. I suspect this is because I
> specified the cluster, stratum, and pweights in my -svyset- command, whereas
> the software she used allowed only for the specification of weights (to
> specify a complex sampling design in SPSS requires an extension that costs
> about $600).
>
> For those who are familiar with SPSS, she used the following syntax after
> applying weights, and subsetting for a specific level of VARIABLE_C:
>
> MEANS TABLES= VARIABLE_A BY VARIABLE_B
> /CELLS MEAN COUNT STDDEV
> /STATISTICS ANOVA.
>
> I believe the equivalent in Stata to get the means and p-values is to use
> the following code, but as Steve pointed out in the conversation copied
> below from 2009, this is not theoretically correct:
>
> . svy: mean VARIABLE_A if (VARIABLE_C==4), over(VARIABLE_B)
>
> . test [VARIABLE_A]_subpop_1 = [VARIABLE_A]_subpop_2 = [VARIABLE_A]_subpop_3
>
> My question is whether I should be attempting to compare the means using the
> -svyset-/-test- commands at all (is what I am trying to do
> legitimate), or if I should omit this comparison from my tables?
>
> Thanks,
> Rieza
>
> --------------------------------------------------------------------------------
> ---------------------
>
> Re: st: comparing multiple means with survey data
>
> ________________________________
> From [email protected]
> To [email protected]
> Subject Re: st: comparing multiple means with survey data
> Date Tue, 23 Jun 2009 12:52:47 -0400
> ________________________________
>
> Your syntax is correct. You don't need the "linearized" option, as it
> is the default for -svy: mean-.
>
> However, hypothesis testing is usually not appropriate for finite
> population studies. See:
> http://www.stata.com/statalist/archive/2009-02/msg00806.html If
> hypothesis testing is appropriate for your situation , then you should
> exclude the finite population correction (fpc) option from your
> -svyset- command.
>
> I'm guessing that you also (or only) want to know how different the
> means in the categories of var2 are. Confidence intervals will
> provide the answer, and you can keep the finite population correction
> in your -svyset- statement if appropriate.
>
> It is poor practice (and cumbersome) to label categories with strings
> like "var2a" "var2b". These are unnecessary as "a", "b", .. have no
> descriptive value. Just make var2 a numeric variable with values 1 2
> 3 4 5. Use -label define- and -label values- to associate the
> numeric values with descriptive text.
>
> Assuming that you do that, the easiest way to to get confidence
> intervals for all pairwise differences after -svy: mean- is to write
> out the 10 statements
>
> lincom _b[1] - _b[2]
> lincom _b[1] - _b[3]
> ...
> lincom _b[4] - _b[5]
>
> For plotting continuous outcomew with groups I recommend -dotplot-
> although it will not take weights.
>
> -Steve
>
> On Mon, Jun 22, 2009 at 4:31 PM, Jean-Gael Collomb <[email protected]> wrote:
>
> Hello -
> I have been struggling to find a way to compare the means of a different
> categories of one of my variable. I think I have found a way but I wonder
> if there would be a more efficient way to do it. In the following example,
>
> var 2 has five categories (var2a-var2e).
> Here's teh commands I type (after survey setting the data):
>
> svy linearized : mean var1, over(var2)
> test [var1]var2a = [var1]var2b = [var1]var2c = [var1]var2d = [var1]var2e,
> mtest(b)
> test [var1]var2b = [var1]var2c = [var1]var2d = [var1]var2e, mtest(b)
> test [var1]var2c = [var1]var2d = [var1]var2e, mtest(b)
> test [var1]var2d = [var1]var2e, mtest(b)
>
> Is there a better way to do this?
>
> Thanks!
>
> Jean-Gael "JG" Collomb
> PhD candidate
> School of Natural Resources and Environment / School of Forest Resources
> and
> Conservation
> University of Florida
>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/statalist/faq
> * http://www.ats.ucla.edu/stat/stata/
>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/statalist/faq
> * http://www.ats.ucla.edu/stat/stata/
>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/statalist/faq
> * http://www.ats.ucla.edu/stat/stata/
>
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
Prev by Date: Re: st: RE: merge
Next by Date: Re: st: RE: merge
Previous by thread: st: ARIMA equation for MC simulation an time series operator reversion?
Next by thread: st: How to combine multiple imputation with reg3 for simultaneous equations
Index(es):
- Date
- Thread