Home  /  Resources & support  /  FAQs  /  Creating variables that are -plurality- measures

How do I calculate measures such as percent improved minus percent deteriorated?

Title   Creating variables that are “plurality” measures
Author Nicholas J. Cox, Durham University, UK

When reporting ordered or graded scales, working with simple descriptive summaries like

% improved − % deteriorated

or

% ranking as good − % ranking as bad

is sometimes helpful. In such summaries, omitting any neutral or middle category is common (but not essential). Clearly, such a measure gives the preponderance of two tails: if everybody improved, we get 100, and, if everybody got worse, we get −100.

In political terms, an election could be imagined in which there are votes “for” and “against” from these two categories, and from that context, these measures may be described as plurality measures. (Is there a better general term, or any term that is standard in some field, for particular examples of such measures?) Whatever the terminology, such measures are discussed in Tukey (1977, 498–502), Zeisel (1985, 75–77), Wilkinson (2005, 57–58), and Wexler, Shaffer, and Cotgreave (2017, 186–200).

Naturally, the percent formulation is not compulsory, and you could just as easily—in fact, a little more easily—work with proportions or fractions with results ranging from 1 to −1. In either case, using a difference is natural whenever thinking is in terms of the percent or proportion scale being used. Also, a ratio such as

% ranking as good / % ranking as bad

may be less desirable with small denominators. Either the result may be unstable, or, if the denominators are ever 0, it may be indeterminate.

Consider a three-grade coding, say, 1 = improved, 2 = unchanged, and 3 = deteriorated. To get this summary, we need to generate a score

        (*) gen score = (code == 1) - (code == 3)

or

        gen score = 100 * ((code == 1) - (code == 3))

and that’s essentially it. We just follow the generation by summarize; tabulate, summarize( ); tabstat; or whatever we need.

Taking (*) piece by piece: if code is 1,

        code == 1

is true and is evaluated as 1, and

        code == 3

is false and evaluated as 0, so

        (code == 1) - (code == 3)

evaluates as 1. The principle of true-or-false logical expressions being evaluated as 1 or 0 is discussed at [U] 13.2.3 Relational operators. If code is 2, then score is 0, and if code is 3, then score is −1. (If we multiply by 100, then score is 100, 0, and −100.)

With this coding, we could also get the same result by

        generate score = 2 - code

If the coding had been reversed, from 1 = deteriorated to 3 = improved, then code - 2 would have worked. For other simple coding sequences, some other linear transformation would have worked. So, why place so much stress on the earlier formulation? It generalizes much more easily to messier examples. Take a five-point scale such as rep78 in the auto data or 1 = strongly agree to 5 = strongly disagree. We might decide to omit the 3s, lump together two codes in each tail, and

        gen score = (code >= 4) - (code <= 2)

Just as before, the true-or-false expressions evaluate as 1 if true and 0 if false.

A pitfall to be pointed out immediately is that missing values count as higher than any other numeric value. Hence, you will be safer with

        gen score = (code >= 4) - (code <= 2) if code < .

Similar ideas may be useful in situations with just two categories. Also, they may arise with different data structures. Let us illustrate both points with the idea of looking at gender roles across a set of activities, and

        % who are female − % who are male

as a way of summarizing data on who does what. If, in a village, 21 women and zero men do laundry, four men and 11 women fetch water, and 14 men and zero women take care of cows, then neither the male–female ratio nor the female–male ratio can be used throughout to summarize the balance of the sexes. Whenever zero is a denominator, the ratio is indeterminate. Even if no zeros are present, we should worry about sensitivity. However, the measure above is one which is always practical. If the data come as three variables, one for activity, f for females, and m for males, then no logical expressions are needed. Simply type

        gen balance = 100 * ((f/(f + m)) - (m/(f + m)))

References

Tukey, J. W. 1977.
Exploratory Data Analysis. Reading, MA: Addison–Wesley.
Wexler, S., J. Shaffer, and A. Cotgreave. 2017.
The Big Book of Dashboards: Visualizing Your Data Using Real-World Business Scenarios. Hoboken, NJ: John Wiley.
Wilkinson, L. 2005.
The Grammar of Graphics. 2nd ed. New York: Springer.
Zeisel, H. 1985.
Say It with Figures. 6th ed. New York: Harper & Row.