Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

# Re: st: Interrater agreement: finding the problematic items

 From Nick Cox To "statalist@hsphsun2.harvard.edu" Subject Re: st: Interrater agreement: finding the problematic items Date Fri, 21 Jun 2013 17:06:36 +0100

```As said earlier, there is only one rule, to use your own full name.
Nick
njcoxstata@gmail.com

On 21 June 2013 16:50, Ilian, Henry (ACS) <Henry.Ilian@dfa.state.ny.us> wrote:
> Nick,
>
> Thank you for your suggestions.
>
> Because I was under time pressure, I decided to go with your earlier suggestion and use standard deviations. It isn't an entirely satisfactory solution because, as I began to set it up, I realized that the NA choices (indicating that the rater determined that the task wasn't needed) were also important (I hadn't realized that to begin with), and they were difficult to fit into the ordinal scale. I chose to assign them a value of -1, making them one-point removed from a rating that the task needed to be done but wasn't (0), two points from the task was partially done (1), and three points removed from the task was completely done (2). I'm not sure that this was the best place in the scale, but no other place seemed any better.  For practical, as opposed to scientific, purposes, this approach is probably good enough, especially since all of the flagged items need to be discussed by all of the raters, and the time available for this discussion is limited.
>
> To find a solution with fewer possibilities for errors, I want to work through your example for the next round of interrater reliability, which will come in about three months. I haven't had the chance to do it this time.
>
> Somebody also suggested I look into the Krippendorff alpha, which is designed to assess interrater agreement for individual items, and which uses probabilities of answer choices--using the sum of squared probabilities was another of your suggestions. I think this approach is also very promising.
>
> If anyone is interested, there are a number of on-line discussions of this technique. I found a very accessible discussion by Krippendorff [Krippendorff, K. (2011). "Computing Krippendorff's Alpha-Reliability."  Annenberg School for Communication Departmental Papers.)] It's easy to find by typing "Computing Krippendorff's Alpha-Reliability into a search bar.
>
> I just read your post on Statlist advice, and I hope I'm not violating any rules in this response.
>
> Again, thanks,
>
> Henry
>
> -----Original Message-----
> From: owner-statalist@hsphsun2.harvard.edu [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of Nick Cox
> Sent: Friday, June 14, 2013 1:53 PM
> To: statalist@hsphsun2.harvard.edu
> Subject: Re: st: Interrater agreement: finding the problematic items
>
> Many people seem unaware of the simplicity and generality of various measures of inequality, diversity and concentration. (There are many other names.) They may be under the impression that they are rather odd and ad hoc measures used by people in rather odd and ad hoc fields such as economics, sociology or ecology.
>
> Here are a few examples of two such measures done calculator-style.
> All we are assuming is a set of categories, not even ordered, not even numbered, just labelled.
>
> (There are many, many others, but I like these two measures.)
>
> For a change,
>
> . sysuse auto, clear
> (1978 Automobile Data)
>
> . tab rep78, matcell(f_rep)
>
>      Repair |
> Record 1978 |      Freq.     Percent        Cum.
> ------------+-----------------------------------
>           1 |          2        2.90        2.90
>           2 |          8       11.59       14.49
>           3 |         30       43.48       57.97
>           4 |         18       26.09       84.06
>           5 |         11       15.94      100.00
> ------------+-----------------------------------
>       Total |         69      100.00
>
> . tab foreign, matcell(f_for)
>
>    Car type |      Freq.     Percent        Cum.
> ------------+-----------------------------------
>    Domestic |         52       70.27       70.27
>     Foreign |         22       29.73      100.00
> ------------+-----------------------------------
>       Total |         74      100.00
>
> The stages are
>
> 1. Copy the vectors of frequencies into vectors in Mata.
> 2. Scale to vectors of probabilities.
>
> 3. The sum of squared probabilities is a measure of agreement.
> Everyone agrees => every one is in one category. One probability is 1 and the others are 0, so sum is 1. Lower limit is 0 (not reached in
> practice.)
>
> This measure, or a relative of it, is variously named for, or attributed to Gini, Turing,  Hirschman, Simpson, Herfindahl, Good and no doubt others.
>
> 4. The reciprocal of this has a nice interpretation as "the equivalent number of equally common categories".
>
> 5. The weighted mean of the log reciprocal probabilities is often known as the entropy. If is often named for Shannon (occasionally for Weaver as well) and/or Wiener. (Weaver and Wiener were precisely two distinct people, but under conditions of lax spelling standards some students have known to attempt to merge them retrospectively.)
>
> 6. Exponentiating that gives a number with a nice interpretation as "the equivalent number of equally known categories" (another estimate thereof).
>
> . mata
> ------------------------------------------------- mata (type end to
> exit) -----------
> : f1 = st_matrix("f_rep")
>
> : f1
>         1
>     +------+
>   1 |   2  |
>   2 |   8  |
>   3 |  30  |
>   4 |  18  |
>   5 |  11  |
>     +------+
>
> : p1 = f1 :/ sum(f1)
>
> : p1
>                  1
>     +---------------+
>   1 |  .0289855072  |
>   2 |   .115942029  |
>   3 |  .4347826087  |
>   4 |  .2608695652  |
>   5 |  .1594202899  |
>     +---------------+
>
> : p1:^2
>                  1
>     +---------------+
>   1 |  .0008401596  |
>   2 |  .0134425541  |
>   3 |  .1890359168  |
>   4 |  .0680529301  |
>   5 |  .0254148288  |
>     +---------------+
>
> : sum(p1:^2)
>   .2967863894
>
> : 1/sum(p1:^2)
>   3.369426752
>
> : sum(p1 :* ln(1:/p1))
>   1.357855957
>
> : exp(sum(p1 :* ln(1:/p1)))
>   3.887848644
>
> :
> : f2 = st_matrix("f_rep")
>
> : f2
>         1
>     +------+
>   1 |   2  |
>   2 |   8  |
>   3 |  30  |
>   4 |  18  |
>   5 |  11  |
>     +------+
>
> : p2 = f2 :/ sum(f2)
>
> : p2
>                  1
>     +---------------+
>   1 |  .0289855072  |
>   2 |   .115942029  |
>   3 |  .4347826087  |
>   4 |  .2608695652  |
>   5 |  .1594202899  |
>     +---------------+
>
> : p2:^2
>                  1
>     +---------------+
>   1 |  .0008401596  |
>   2 |  .0134425541  |
>   3 |  .1890359168  |
>   4 |  .0680529301  |
>   5 |  .0254148288  |
>     +---------------+
>
> : sum(p2:^2)
>   .2967863894
>
> : 1/sum(p2:^2)
>   3.369426752
>
> : sum(p2 :* ln(1:/p2))
>   1.357855957
>
> : exp(sum(p2 :* ln(1:/p2)))
>   3.887848644
>
> :
> : end
> -------------------------------------------------------------------------------------
> Nick
> njcoxstata@gmail.com
>
>
> On 14 June 2013 16:34, Nick Cox <njcoxstata@gmail.com> wrote:
>>
>> Some Statalist members are well versed in psychometrics but I see no
>> reason why more general statistical ideas should not relevant too. The
>> standard deviation of ratings for each item would be one measure of
>> disagreement. Perhaps better ones would be the sum of squared
>> probabilities or the entropy of the probability distribution for the
>> rating.
>> Nick
>> njcoxstata@gmail.com
>>
>>
> On 14 June 2013 16:11, Ilian, Henry (ACS) <Henry.Ilian@dfa.state.ny.us> wrote:
>
>>> I'm doing an interrater agreement study on a case-reading instrument. There are five reviewers using an instrument with 120 items. The ratings scales are ordinal with either two, three or four options. I'm less interested in reviewer tendencies than I am in problematic items, those with high levels of disagreement.
>>>
>>> Most of the interrater agreement/interrater reliability statistics look at reviewer tendencies. I can see two ways of getting at agreement on items. The first is to sum all the differences between all possible pairs of reviewers, and those with the highest totals are the ones to examine. The other is Chronbach's alpha. Is there any strong argument for or against either approach, and is there a different approach that would be better than these?
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/
>
>
> Confidentiality Notice: This e-mail communication, and any attachments, contains confidential and privileged information for the exclusive use of the recipient(s) named above. If you are not an intended recipient, or the employee or agent responsible to deliver it to an intended recipient, you are hereby notified that you have received this communication in error and that any review, disclosure, dissemination, distribution or copying of it or its contents is prohibited. If you have received this communication in error, please notify me immediately by replying to this message and delete this communication from your computer. Thank you.
>
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/
```