Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: reliability with -icc- and -estat icc-

From   "JVerkuilen (Gmail)" <>
Subject   Re: st: reliability with -icc- and -estat icc-
Date   Wed, 27 Feb 2013 15:30:22 -0500

On Wed, Feb 27, 2013 at 11:57 AM, Rebecca Pope <> wrote:
> Jay, I don't know which Rabe-Hesketh & Skrondal text you are referring
> to, but if it is MLM Using Stata (2012) you'll want Ch 9.

Yes, that's it. I'm not near my copies at the moment to look at them,
but I've read it. I'll look later.

> In the interest of full disclosure, like Nick, psychometrics is not my
> specialty.

Lots of folks compute ICCs.

The following is as much for my edification as to add to
> the group discussion. Joseph has used two random effects rather than
> one (leaving aside the whole target/rater issue for now). This
> corresponds to crossed effects (advised by R-H & S, so I should have
> read the book yesterday instead of adapting UCLA's code to match
> -icc-) and will reduce the ICC. This differs by design from what is
> implemented with -icc-, which treats the target as fixed, as does the
> code I posted originally. In short Jay, while _all: R isn't wrong, my
> use of fixed effects for part of the model was. Does that sum it up
> appropriately?

I'm not sure but I think so.

If the apps are the ones of interest, then that's fixed effects. If,
on the other hand, you want to generalize to other apps then that's
random effects both ways. In psychlinguistics this is referred to as
the "language as fixed effects" fallacy because they should be using
crossed models due to the desire to generalize to both the corpus and
the population of respondents.

Anyway, I was trying to figure out why the model was specified that
way, though it turns out they were the same, which is not unusual in
mixed models.

> If dramatic disagreement is a mark of unreliable raters, what does
> that say for elections, user reviews, or for that matter faculty
> search committees? Don't take this to mean I don't understand the
> general concept that you want raters to concur. However, we're talking
> about smartphone apps, not e.g. radiology where there is a "true"
> answer (yes/no lung cancer). Hypothetically at least, you could
> legitimately have strong and divergent opinions.

Totally agree. Modeling agreement can have many motives. So for
instance, folks who study romantic couples often use ICC-type
coefficients as measures of similarity. It's not obvious that they
should be high, medium, or low.

> I would argue here that the issue of rater reliability is not an issue
> of disagreement but rather Rater 2's utter inability to distinguish
> between applications. Now, perhaps this means that the applications
> really aren't substantively different from each other and you found 3
> people who just wanted to accomodate you by completing the ranking
> task and Rater 2 happened to be honest. Who knows. I'd say it's
> unlikely, but I've seen some pretty unlikely things happen...

Yes, totally agree. The reason to model something is to determine
this. I think the general view I take is that a meaningful consensus
wasn't arrived at and so the notion of an ICC is probably not
meaningful here. Why that is true needs to be considered further, for
instance by interviewing the raters.

JVVerkuilen, PhD

"It is like a finger pointing away to the moon. Do not concentrate on
the finger or you will miss all that heavenly glory." --Bruce Lee,
Enter the Dragon (1973)
*   For searches and help try:

© Copyright 1996–2016 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index