Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.

# st: Transforming response scales

 From "Clyde Schechter" To statalist@hsphsun2.harvard.edu Subject st: Transforming response scales Date Wed, 17 Mar 2010 20:14:35 -0700

```<>

This is a statistics/questionnaire design problem.  In a study, one of our
measures is a 17-item questionnaire.  Each item has a discrete 4-point
response scale, with respone 1 anchored at "Not at all" and 4 anchored at
"A great deal."  No descriptors were provided for the intermediate
responses.

Unfortunately, the first 70 respondents received an incorrect version of
the questionnaire where the response scale went from 1 to 6, with the same
anchors at the extremes, and no descriptors for responses 2 through 5.
After the error was noted, the questionnaire was fixed, and the correct
The study is ongoing, and we hope to obtain data from an additional 250 or
so respondents by the time we're done.

The question is whether we can salvage the data from those first 70
respondents by transforming the 6-point response set onto the 4-point
version.  There are only 10 monotone increasing functions from
{1,2,3,4,5,6} onto {1,2,3,4}, so I tried applying each of them to see how
the overall distribution of transfomed responses would compare to the
observed distribution of responses from those who were given the correct
version of the questionnaire.  One transformation:

recode resp6 (1/2=1) (3=2) (4=3) (5/6=4), gen(resp6_4)

produces a qualitatively decent match of response frequencies:

response |     resp4    resp6_4 |     Total
-----------+----------------------+----------
1 |     1,028        787 |     1,815
|     64.41      67.26 |     65.62
-----------+----------------------+----------
2 |       169         86 |       255
|     10.59       7.35 |      9.22
-----------+----------------------+----------
3 |       158         81 |       239
|      9.90       6.92 |      8.64
-----------+----------------------+----------
4 |       241        216 |       457
|     15.10      18.46 |     16.52
-----------+----------------------+----------
Total |     1,596      1,170 |     2,766
|    100.00     100.00 |    100.00

While these distributions are "statisticallly significantly" different by
an ordinary chi square test, that overstates the difference because these
individual responses are nested within items and respondents.

Since these responses are ordinal in nature, and will be later analyzed by
calculating mean item response for each respondent anyway, I also looked
at the difference in mean response.  In a mixed model, with item as a
fixed effect and a random effect for respondent, the adjusted mean
difference in response between the group given the correct 4-point
response set and the transformed responses from those given the 6-point
response set is only 0.01, 95% CI -0.24 to +0.26.  The estimated
difference of 0.01 strikes me as a small enough bias to ignore, though if
the true bias is near the extremes of that 95% CI, I would be concerned.

I'm operating on the assumption that the group who received the 6-point
response set are something like a random subset of the respondents.  They
were the first 70 enrolled in the study.  There is no reason to expect any
secular trend in what we are measuring during the course of the study,
though I suppose one could think about things like learning curves on the
part of study personnel somehow influencing these responses (obtained over
of time).

It dawns on me that if I were to discard the data from the first 70, treat
them as missing, and (multiply) impute them from other data we are
gathering from our participants using, say, a multivariate normal
imputation model, nobody would object.  It feels to me as if the kind of
deterministic transformation I'm looking at in this message would be a
more valid kind of imputation, based as it is on response to an identical
question with a related response set.  But it is, of course,_ad hoc,_ and
I do not have the theoretical knowledge to figure out what kind of
statistical properties inferences based on it would have.

Wondering if any of the Statalisters have a reaction to this idea.

Clyde Schechter, MA MD
Associate Professor of Family & Social Medicine
Albert Einstein College of Medicine, Bronx, NY, USA

Clyde Schechter, MA MD
Associate Professor of Family & Social Medicine