Title | Creating variables that are “plurality” measures | |

Author | Nicholas J. Cox, Durham University, UK |

When reporting ordered or graded scales, working with simple descriptive summaries like

% improved − % deteriorated

or

% ranking as good − % ranking as bad

is sometimes helpful. In such summaries, omitting any neutral or middle category is common (but not essential). Clearly, such a measure gives the preponderance of two tails: if everybody improved, we get 100, and, if everybody got worse, we get −100.

In political terms, an election could be imagined in which there are votes “for” and “against” from these two categories, and from that context, these measures may be described as plurality measures. (Is there a better general term, or any term that is standard in some field, for particular examples of such measures?) Whatever the terminology, such measures are discussed in Tukey (1977, 498–502), Zeisel (1985, 75–77), Wilkinson (2005, 57–58), and Wexler, Shaffer, and Cotgreave (2017, 186–200).

Naturally, the percent formulation is not compulsory, and you could just as easily—in fact, a little more easily—work with proportions or fractions with results ranging from 1 to −1. In either case, using a difference is natural whenever thinking is in terms of the percent or proportion scale being used. Also, a ratio such as

% ranking as good / % ranking as bad

may be less desirable with small denominators. Either the result may be unstable, or, if the denominators are ever 0, it may be indeterminate.

Consider a three-grade coding, say, 1 = improved, 2 = unchanged, and 3 = deteriorated. To get this summary, we need to generate a score

(*) gen score = (code == 1) - (code == 3)

or

gen score = 100 * ((code == 1) - (code == 3))

and that’s essentially it. We just follow the generation by
**summarize**; **tabulate, summarize( )**; **tabstat**; or whatever
we need.

Taking **(*)** piece by piece: if **code** is 1,

code == 1

is true and is evaluated as 1, and

code == 3

is false and evaluated as 0, so

(code == 1) - (code == 3)

evaluates as 1. The principle of true-or-false logical expressions being
evaluated as 1 or 0 is discussed at [U] **13.2.3 Relational operators**.
If **code** is 2, then **score** is 0, and if **code** is 3, then
**score** is −1. (If we multiply by 100, then **score** is 100,
0, and −100.)

With this coding, we could also get the same result by

generate score = 2 - code

If the coding had been reversed, from 1 = deteriorated to 3 = improved, then
**code - 2** would have worked. For other simple coding sequences, some
other linear transformation would have worked. So, why place so much stress
on the earlier formulation? It generalizes much more
easily to messier examples. Take a five-point scale such as **rep78** in the
auto data or 1 = strongly agree to 5 = strongly disagree. We might decide to
omit the 3s, lump together two codes in each tail, and

gen score = (code >= 4) - (code <= 2)

Just as before, the true-or-false expressions evaluate as 1 if true and 0 if false.

A pitfall to be pointed out immediately is that missing values count as higher than any other numeric value. Hence, you will be safer with

gen score = (code >= 4) - (code <= 2) if code < .

Similar ideas may be useful in situations with just two categories. Also, they may arise with different data structures. Let us illustrate both points with the idea of looking at gender roles across a set of activities, and

% who are female − % who are male

as a way of summarizing data on who does what. If, in a village, 21 women and
zero men do laundry, four men and 11 women fetch water, and 14 men and zero women
take care of cows, then neither the male–female ratio nor the
female–male ratio can be used throughout to summarize the balance of
the sexes. Whenever zero is a denominator, the ratio is indeterminate. Even
if no zeros are present, we should worry about sensitivity. However, the
measure above is one which is always practical. If the data come as three
variables, one for activity, **f** for females, and **m** for males,
then no logical expressions are needed. Simply type

gen balance = 100 * ((f/(f + m)) - (m/(f + m)))

- Tukey, J. W. 1977.
*Exploratory Data Analysis*. Reading, MA: Addison–Wesley.

- Wexler, S., J. Shaffer, and A. Cotgreave. 2017.
*The Big Book of Dashboards: Visualizing Your Data Using Real-World Business Scenarios*. Hoboken, NJ: John Wiley.

- Wilkinson, L. 2005.
*The Grammar of Graphics*. 2nd ed. New York: Springer.

- Zeisel, H. 1985.
*Say It with Figures*. 6th ed. New York: Harper & Row.