Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down at the end of May, and its replacement, is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: Transforming response scales

From   "Clyde Schechter" <>
Subject   st: Transforming response scales
Date   Wed, 17 Mar 2010 20:14:35 -0700


This is a statistics/questionnaire design problem.  In a study, one of our
measures is a 17-item questionnaire.  Each item has a discrete 4-point
response scale, with respone 1 anchored at "Not at all" and 4 anchored at
"A great deal."  No descriptors were provided for the intermediate

Unfortunately, the first 70 respondents received an incorrect version of
the questionnaire where the response scale went from 1 to 6, with the same
anchors at the extremes, and no descriptors for responses 2 through 5. 
After the error was noted, the questionnaire was fixed, and the correct
version has been administered to just under 400 additional participants. 
The study is ongoing, and we hope to obtain data from an additional 250 or
so respondents by the time we're done.

The question is whether we can salvage the data from those first 70
respondents by transforming the 6-point response set onto the 4-point
version.  There are only 10 monotone increasing functions from
{1,2,3,4,5,6} onto {1,2,3,4}, so I tried applying each of them to see how
the overall distribution of transfomed responses would compare to the
observed distribution of responses from those who were given the correct
version of the questionnaire.  One transformation:

recode resp6 (1/2=1) (3=2) (4=3) (5/6=4), gen(resp6_4)

produces a qualitatively decent match of response frequencies:

  response |     resp4    resp6_4 |     Total
         1 |     1,028        787 |     1,815
           |     64.41      67.26 |     65.62
         2 |       169         86 |       255
           |     10.59       7.35 |      9.22
         3 |       158         81 |       239
           |      9.90       6.92 |      8.64
         4 |       241        216 |       457
           |     15.10      18.46 |     16.52
     Total |     1,596      1,170 |     2,766
           |    100.00     100.00 |    100.00

While these distributions are "statisticallly significantly" different by
an ordinary chi square test, that overstates the difference because these
individual responses are nested within items and respondents.

Since these responses are ordinal in nature, and will be later analyzed by
calculating mean item response for each respondent anyway, I also looked
at the difference in mean response.  In a mixed model, with item as a
fixed effect and a random effect for respondent, the adjusted mean
difference in response between the group given the correct 4-point
response set and the transformed responses from those given the 6-point
response set is only 0.01, 95% CI -0.24 to +0.26.  The estimated
difference of 0.01 strikes me as a small enough bias to ignore, though if
the true bias is near the extremes of that 95% CI, I would be concerned.

I'm operating on the assumption that the group who received the 6-point
response set are something like a random subset of the respondents.  They
were the first 70 enrolled in the study.  There is no reason to expect any
secular trend in what we are measuring during the course of the study,
though I suppose one could think about things like learning curves on the
part of study personnel somehow influencing these responses (obtained over
the telephone--respondents received the questionnaires in the mail ahead
of time).

It dawns on me that if I were to discard the data from the first 70, treat
them as missing, and (multiply) impute them from other data we are
gathering from our participants using, say, a multivariate normal
imputation model, nobody would object.  It feels to me as if the kind of
deterministic transformation I'm looking at in this message would be a
more valid kind of imputation, based as it is on response to an identical
question with a related response set.  But it is, of course,_ad hoc,_ and
I do not have the theoretical knowledge to figure out what kind of
statistical properties inferences based on it would have.

Wondering if any of the Statalisters have a reaction to this idea.

Thanks for your consideration.

Clyde Schechter, MA MD
Associate Professor of Family & Social Medicine
Albert Einstein College of Medicine, Bronx, NY, USA

Please note new e-mail address:

Clyde Schechter, MA MD
Associate Professor of Family & Social Medicine

Please note new e-mail address:

*   For searches and help try:

© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index