Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: Analysis of scoring data???

From   Rose Medeiros <>
Subject   Re: st: Analysis of scoring data???
Date   Wed, 16 Nov 2005 13:37:20 -0500


The data you have will not allow you to test hypotheses such as: "Clinicians of type one score more highly than those of type two." While each clinician has rated 100 images, you only have ratings by three clinicians of each type, thus you cannot test hypotheses about the population of these two types of clinicians, you just don't have enough clinicians.

You should be able to address hypotheses about the relative usefulness of various types of images. However, keep in mind that 100 is still a relatively small sample size, and this small sample is further divided among four types of images. To some extent, what you should do with the multiple ratings depends on how you want to think about these ratings. If you believe that each clinicians rating of each image is independent, that is, that how useful one clinician thinks an image is, is unrelated to how useful another clinician thinks it is, then you can treat each physicians rating of each image as a case (n = 600, as opposed to n = 100), and perform a chi-squared test. However, logically, this doesn't make sense, one would expect a diagnostic image that is useful to one clinician to also be useful to another clinician.
If the purpose of having multiple clinicians rate each image is to get a more accurate rating of the "true" utility of each image (akin to the old adage, measure twice, cut once), then you could combine the scores in some way, by taking the sum or average of the six ratings (these are really equivalent). You could then compare the means of these "true" utility scores, across groups (using a t-test, or ANOVA). This procedure ignores the fact that the level of "agreement" in clinician ratings may vary across images, that is, clinicians may tend to agree (give similar scores) about the utility of some images, and may disagree about others. You could examine this by looking at the standard deviation of each rating, or using cronbach's alpha (-alpha- in Stata).
There are other approaches you could take, but in my opinion, the simplest approach is usually better, especially given that you don't have all that much data.


P.S. Since your question was mostly about which analysis is appropriate, I didn't include a lot of information on how to actually do this in Stata.

K Jensen wrote:

I have data involving assessment of the results of different medical imaging techniques, by different specialists.

For a hundred images, each has been assessed on a four point quality scale by six specialists. All the assessors have scored all the images and there are no missing values. The images fall into four types (there are different numbers of each type) and there are two types of clinician (three of each).

So the data looks like:-
Image_ID Type Clinician C_type Score
1 A Clinican_1 Radiol 0
1 A Clinican_2 Radiol 3
1 A Clinican_3 Radiol 1
1 A Clinican_4 Radiog 2
1 A Clinican_5 Radiog 2
1 A Clinican_6 Radiog 1
100 D Clinican_6 Radiog 3

We are particularly interested in making inferences about the utility of the different types of image. One distinction is between the images scored at 1 or 2 (not useful in practice) v. 3 and 4 (useful).

So a summary could look like:-
image | Score
type | 0 or 1 2 or 3 |
A | Na Ya |
B | Nb Yb |
C | Nc Yc |
D | Nd Yd |
Total | N Y | N+Y=600

Or we could consider using the average scores across clinicians for each image.

Types A and B use different variants of one imaging method, types C and D another.

We would like to test a priori hypotheses, such as "A and B are better than C and D" or "C is better than D" or "Clinicians of type one score more highly than those of type two".

I was tempted to do simple chi square tests based on the rows in the "tabulate" command, but have realised that that that would in a sense be overestimating the sample size by a factor of six, as we have 100 different images assessed by six clinicians, not 600 different images.

I thought about logistic regression (xi:logit command) on the "score 0 or 1" v. "1 or 2" outcome, but the results (either as beta coefficients or odds ratios) would be less easier to interpret than simple probabilities of falling into different categories.

I also thought about using the glm command and assuming binomial family data (for the dichotomous outcome).

As you will have guessed I am no statistician. How would a professional statistician like to see these data analyzed? I have come to realise as I write that this is a general question rather than a specifically Stata one, so I am sorry if this is an inappropriate query for this list.

Thankyou in advance and in hope,


Mailblocks - A Better Way to Do Email

* For searches and help try:

*   For searches and help try:

© Copyright 1996–2017 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index