Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Bootstrapping question


From   Nick Cox <[email protected]>
To   [email protected]
Subject   Re: st: Bootstrapping question
Date   Fri, 8 Feb 2013 10:52:10 +0000

Maarten's general comments (see elsewhere) are well taken. In terms of
what you _could_ do,
the clash between prior knowledge (an outcome is possible) and sample
results (the proportion of a certain outcome is estimated at zero)
would suggest at least a dose of Bayesian thinking to some.

I'm forgetting my own work:

Nicholas J. Cox
Speaking Stata: I. J. Good and quasi–Bayes smoothing of categorical frequencies
Stata Journal 9(2): 306-314

Abstract.  I. J. Good (1916–2009) was a prolific scientist who
contributed to many fields, mostly from a Bayesian standpoint. This
column explains his idea of quasi-Bayes (a.k.a. pseudo-Bayes)
estimation or smoothing of categorical frequencies in a contingency
table, which is especially useful as a way of dealing with awkward
sampling or random zeros. It shows how the method can be implemented,
almost calculator-style, using a combination of Stata and Mata.
Convenience commands qsbayesi and qsbayes are also introduced.

.pdf at http://www.stata-journal.com/sjpdf.html?articlenum=st0168

See also the correction included within

http://www.stata-journal.com/sjpdf.html?articlenum=gr0039

You would need to combine that with -bootstrap-. It won't solve all
the problems. You'd need a fair guess at the "correct" underlying
distribution, easier said than done.

Nick

On Fri, Feb 8, 2013 at 1:59 AM, Nick Cox <[email protected]> wrote:
> No; I suggest that would be confused thinking. With an ordinal scale,
> you should know in advance which values are possible as a matter of
> definition. If one or more values don't occur in a sample, that
> doesn't reduce the number of possibilities any more than observing
> just men at a meeting means that there is really only one gender.
>
> I was thinking vaguely of some kind of multinomial model, but I have
> no ideas on how to implement that.
>
> Here is a demonstration of my point about -bootstrap-. I wrote a
> program which is given the possible integer values of a variable and
> returns the empirical frequencies as fractions. (Note that there is
> actually no assumption of an ordinal scale, but we are estimating
> fractions simultaneously and so those fractions must sum to 1.)
>
> *! 1.0.0 NJC 8 February 2013
> program myfrac, rclass
>         version 8.2
>         syntax varname(numeric) [if] [in] , Values(numlist int min=2) ///
>         [Format(str) *]
>
>         quietly {
>                 marksample touse
>                 count if `touse'
>                 if r(N) == 0 exit 2000
>                 local N = r(N)
>
>                 tokenize "`values'"
>                 local nvals : word count `values'
>                 tempvar vals frac
>                 gen `frac' = 0
>                 gen `vals' = 0
>                 char `frac'[varname] "fractions"
>                 char `vals'[varname] "values"
>
>                 forval i = 1/`nvals' {
>                         count if `touse' & `varlist' == ``i''
>                         replace `frac' = r(N) in `i'
>                         replace `vals' = ``i'' in `i'
>                 }
>
>                 su `frac' in 1/`nvals', meanonly
>                 if r(sum) != `N' {
>                         di as err "values outside `values'?"
>                         exit 498
>                 }
>
>                 replace `frac' = `frac'/`N'
>
>                 forval i = 1/`nvals' {
>                         return scalar r`i' = `frac'[`i']
>                 }
>         }
>
>         if "`format'" == "" local format %5.3f
>         format `frac' `format'
>
>         list `vals' `frac' in 1/`nvals', subvarname noobs `options'
> end
>
> . sysuse auto
> (1978 Automobile Data)
>
> . tab rep78
>
>      Repair |
> Record 1978 |      Freq.     Percent        Cum.
> ------------+-----------------------------------
>           1 |          2        2.90        2.90
>           2 |          8       11.59       14.49
>           3 |         30       43.48       57.97
>           4 |         18       26.09       84.06
>           5 |         11       15.94      100.00
> ------------+-----------------------------------
>       Total |         69      100.00
>
> . myfrac rep78, values(1/5)
>
>   +-------------------+
>   | values   fraction |
>   |-------------------|
>   |      1      0.029 |
>   |      2      0.116 |
>   |      3      0.435 |
>   |      4      0.261 |
>   |      5      0.159 |
>   +-------------------+
>
> . ret li
>
> scalars:
>                  r(r5) =  .1594202965497971
>                  r(r4) =  .260869562625885
>                  r(r3) =  .4347825944423676
>                  r(r2) =  .1159420311450958
>                  r(r1) =  .028985507786274
>
> . bootstrap p1=r(r1) p2=r(r2) p3=r(r3) p4=r(r4) p5=r(r5) : myfrac rep78, values
>> (1/5)
> (running myfrac on estimation sample)
>
> [...]
>
> Bootstrap replications (50)
> ----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
> ..................................................    50
>
> Bootstrap results                               Number of obs      =        74
>                                                 Replications       =        50
>
>       command:  myfrac rep78, values(1/5)
>            p1:  r(r1)
>            p2:  r(r2)
>            p3:  r(r3)
>            p4:  r(r4)
>            p5:  r(r5)
>
> ------------------------------------------------------------------------------
>              |   Observed   Bootstrap                         Normal-based
>              |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
> -------------+----------------------------------------------------------------
>           p1 |   .0289855   .0197784     1.47   0.143    -.0097794    .0677504
>           p2 |    .115942   .0393045     2.95   0.003     .0389066    .1929775
>           p3 |   .4347826   .0647383     6.72   0.000     .3078979    .5616673
>           p4 |   .2608696    .054154     4.82   0.000     .1547297    .3670095
>           p5 |   .1594203   .0385508     4.14   0.000     .0838622    .2349784
> ------------------------------------------------------------------------------
>
> . bootstrap p1=r(r1) p2=r(r2) p3=r(r3) p4=r(r4) p5=r(r5) : myfrac rep78 if fore
>> ign, values(1/5)
> (running myfrac on estimation sample)
>
> [...]
>
> Bootstrap replications (50)
> ----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
> ..................................................    50
>
> Bootstrap results                               Number of obs      =        22
>                                                 Replications       =        50
>
>       command:  myfrac rep78, values(1/5)
>            p1:  r(r1)
>            p2:  r(r2)
>            p3:  r(r3)
>            p4:  r(r4)
>            p5:  r(r5)
>
> ------------------------------------------------------------------------------
>              |   Observed   Bootstrap                         Normal-based
>              |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
> -------------+----------------------------------------------------------------
>           p1 |  (dropped)
>           p2 |  (dropped)
>           p3 |   .1428571   .0830165     1.72   0.085    -.0198522    .3055665
>           p4 |   .4285714   .1162858     3.69   0.000     .2006555    .6564874
>           p5 |   .4285714   .1023406     4.19   0.000     .2279876    .6291553
> ------------------------------------------------------------------------------
>
> There are two major problems at least.
>
> 1. The procedure as programmed here can't give c.i.s associated with
> observed zero frequencies.
> 2. Intervals are not guaranteed to stay in [0,1].
>
> Nick
>
> On Thu, Feb 7, 2013 at 10:27 PM, Ilian, Henry (ACS)
> <[email protected]> wrote:
>> Thanks. That's helpful. If there are zero occurrences of a value, doesn't it mean that the scale has one fewer values: instead of a five-point scale, a four-point scale, etc.? Of course, if the value existed in the population, but didn't appear in the sample, than the sample would be off, and how much off it would be would depend on the prevalence of that value in the population, but isn't always a possibility in sampling.
>>
>> Could you say more about a model for my generating process? I understand models as consisting of explanatory variables and covariates to predict a response variable. That doesn't seem to be the kind of model you mean.
>>
>> -----Original Message-----
>> From: [email protected] [mailto:[email protected]] On Behalf Of Nick Cox
>> Sent: Thursday, February 07, 2013 5:02 PM
>> To: [email protected]
>> Subject: Re: st: Bootstrapping question
>>
>> Suppose that your ordinal categories are potentially 1 ... 5 but in
>> any sample there are zero occurrences of 5. Then no bootstrapped
>> sample (= sampling with replacement) can ever contain 5 and a
>> confidence interval cannot include positive outcomes. There is likely
>> to be a device for this problem, but it suggests to me that you need a
>> model for your generating process.
>>
>> Nick
>>
>> On Thu, Feb 7, 2013 at 9:28 PM, Ilian, Henry (ACS)
>> <[email protected]> wrote:
>>> I looked at the table of contents. The book is clearly worth having, but it doesn't seem to cover the sample-size problem--which actually may not be a problem, since the sample size is what it is, and there isn't a way to make it any larger. By improved, I meant narrower, although that's such an obvious answer I don't think it was what you were asking me. If bootstrapping won't result in narrower confidence intervals, then I'll have to live with the confidence intervals as they are.
>>>
>>> -----Original Message-----
>>> From: [email protected] [mailto:[email protected]] On Behalf Of Nick Cox
>>> Sent: Thursday, February 07, 2013 3:12 PM
>>> To: [email protected]
>>> Subject: Re: st: Bootstrapping question
>>>
>>> This clarifies your situationconsiderably (but still does not explain
>>> what you mean by "improved"!).
>>>
>>> I'd look at Alan Agresti's books e.g.
>>>
>>> http://www.amazon.com/Analysis-Ordinal-Categorical-Probability-Statistics/dp/0470082895/
>>>
>>> Nick
>>>
>>> On Thu, Feb 7, 2013 at 7:57 PM, Ilian, Henry (ACS)
>>> <[email protected]> wrote:
>>>> Nick, you're right. Some of the potential (and actual) outcomes have observed zeros. I looked everywhere I could think of for the formula to compute sample sizes for multiple categories but couldn't find it. In the process, I read somewhere that the problem of multiple categories reduces to a two-category problem. I asked one statistician about this, and he said not true and suggested I take his on-line advanced sampling class. I certainly am considering that, but for the meanwhile, I still have a sample size that results in very wide confidence intervals. Again, my understanding from reading about bootstrapping is that one of the things bootstrapping was designed to do was to improve estimates of confidence intervals in small samples. My question is, can I use it in this situation, and if I do, what do I report?
>>>>
>>>> Many thanks,
>>>>
>>>> Henry
>>>>
>>>> -----Original Message-----
>>>> From: [email protected] [mailto:[email protected]] On Behalf Of Nick Cox
>>>> Sent: Thursday, February 07, 2013 4:18 AM
>>>> To: [email protected]
>>>> Subject: Re: st: Bootstrapping question
>>>>
>>>> This focuses on intervals for a single (binomial) proportion. Henry's
>>>> question was about ordinal variables.
>>>> At a wild guess, he wants simultaneous confidence intervals. It also
>>>> sounds as if some of his potential outcomes have observed zeros.
>>>>
>>>> Nick
>>>>
>>>> On Thu, Feb 7, 2013 at 3:24 AM, Steve Samuels <[email protected]> wrote:
>>>>>
>>>>> Lenth's site states that he use the "exact", presumably Clopper-Pearson,
>>>>> intervals, which is known to be conservative. But this is not
>>>>> what Stata computes. Since a 50% sample proportion is not possible with
>>>>> n = 27, the closest one can get to 50% is with k = 13 or 14 events. I'm
>>>>> not sure what Lenth's applet shows (I don't have Java enabled), but Stata
>>>>> does not show a 17.3% margin of error, rather a number closer to 20%
>>>>>
>>>>> . cii 27 13
>>>>>                                                       -- Binomial Exact --
>>>>>  Variable |        Obs        Mean    Std. Err.       [95% Conf. Interval]
>>>>> ----------+---------------------------------------------------------------
>>>>>           |         27    .4814815     .096159        .2866725    .6805035
>>>>>
>>>>> . cii 27 14
>>>>>                                                      -- Binomial Exact --
>>>>>  Variable |        Obs        Mean    Std. Err.       [95% Conf. Interval]
>>>>> ----------+---------------------------------------------------------------
>>>>>           |         27    .5185185     .096159        .3194965    .7133275
>>>>>
>>>>> You could have skipped the trip to Length's site and used Stata's own -cii- command
>>>>> with the Wilson intervals, recommended for n < 40 (Brown et al, 2008).
>>>>>
>>>>> . cii 27 13, wilson
>>>>>                                                         ------ Wilson ------
>>>>>  Variable |        Obs        Mean    Std. Err.       [95% Conf. Interval]
>>>>> ----------+---------------------------------------------------------------
>>>>>           |         27    .4814815     .096159        .3074323    .6601438
>>>>>
>>>>> . cii 27 14, wilson
>>>>>                                                         ------ Wilson ------
>>>>>   Variable |        Obs        Mean    Std. Err.       [95% Conf. Interval]
>>>>> ---------+---------------------------------------------------------------
>>>>>            |         27    .5185185     .096159        .3398562    .6925677
>>>>>
>>>>> Brown, L. D., T. T. Cai, and A. DasGupta. 2001. Interval estimation for
>>>>> a binomial proportion. Statistical Science 16: 101-133
>>>>>
>>>>> Steve
>>>>>
>>>>> Steven J Samuels
>>>>> Consulting Statistician
>>>>> 18 Cantine's Island
>>>>> Saugerties NY 12477
>>>>> Voice: 845-246-0774
>>>>>
>>>>> On Feb 6, 2013, at 8:08 PM, Nick Cox wrote:
>>>>>
>>>>> I don't think I can add usefully to my previous comments. Saying that
>>>>> you used a particular program does not convey much to me. What does
>>>>> "improve the confidence intervals" mean?
>>>>>
>>>>> Nick
>>>>>
>>>>> On Wed, Feb 6, 2013 at 8:53 PM, Ilian, Henry (ACS)
>>>>> <[email protected]> wrote:
>>>>>> The sample size 1s 27, which is the largest number the case readers can handle in the amount of time allotted. The population the sample is drawn from is 140. To get a confidence interval for proportion I used Lenth's on-line application, http://homepage.cs.uiowa.edu/~rlenth/Power/. Since the items have different proportions, there are several confidence intervals. Using 50% as the proportion (meaning that for a particular item, 50% of the sample were awarded the highest ordinal rating, and the other 50% were awarded other ratings), I got a margin of error of 17.3%. For a proportion of 70%, the margin of error is 16%, etc.
>>>>>>
>>>>>> I'm new to the idea of bootstrapping, but it seemed to be a way to improve the confidence intervals.
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: [email protected] [mailto:[email protected]] On Behalf Of Nick Cox
>>>>>> Sent: Wednesday, February 06, 2013 2:04 PM
>>>>>> To: [email protected]
>>>>>> Subject: Re: st: Bootstrapping question
>>>>>>
>>>>>> For once, my line differs slightly from Maarten's.
>>>>>>
>>>>>> The crunch is that nowhere did Henry state where his confidence
>>>>>> intervals come from. If they were based on inappropriate assumptions,
>>>>>> bootstrapping may do better. But if the confidence intervals one way
>>>>>> are wide, the expectation is of a similar story from -bootstrap-.
>>>>>>
>>>>>> Nick
>>>>>>
>>>>>> On Wed, Feb 6, 2013 at 6:55 PM, Maarten Buis <[email protected]> wrote:
>>>>>>> On Wed, Feb 6, 2013 at 5:28 PM, Ilian, Henry (ACS)  wrote:
>>>>>>>> I am working with samples that result in very large confidence intervals, and there is no way to get larger samples. Therefore bootstrapping is an appealing option.
>>>>>>>
>>>>>>> Unfortunately the bootstrap is not going to help. The large confidence
>>>>>>> intervals mean that there is very little information present in your
>>>>>>> data, and no statistical technique can add information that was not
>>>>>>> present in your data to begin with. So it seems that you will just
>>>>>>> have to live with the very large confidence intervals.

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index