Title | Dealing with multiple responses | |
Author |
Nicholas J. Cox, Durham University, UK Ulrich Kohler, University of Mannheim, Germany |
Multiple responses—in the sense used here—are defined by a degree of open-endedness. In particular, a question in a survey may receive zero or more positive answers depending on the characteristics or behavior of the respondent. For example, respondents might be asked: Have you experienced any of the following symptoms or received information on a subject from any of the following media? Do you ever drink tea, coffee, wine, beer, or water? Do you travel to work by foot, bicycle, motorcycle, car, bus, tram, train, boat, ski, skates, sledge, horse, camel, yak, ...? (This may seem like a simple question to you, but consider commuters who cycle or drive to catch a train and then end their journey to work with a walk.)
We are not here to discuss multivariate responses in general, nor repeated measures, nor panel data or longitudinal data, etc.
In statistical computing terms, such multiple responses may pose difficulties both for data structure and for data analysis. Most commonly, they are held as a set of variables, but sometimes it can be useful to hold them as a single variable. No structure is ideal for all purposes, and often you may want to convert from one structure to another. Similarly, you may want to look at results for individual variables or at results calculated from one or more of these variables. The subject is large and one FAQ cannot cover all possibilities. You may be able to add suggestions to those here, so that users may be advised of helpful tips or of pitfalls to be avoided. In particular, we would welcome literature references.
Reference is made below to various user-written programs on SSC or in the Stata Journal (SJ) or the Stata Technical Bulletin (STB). If you need explanation of the SJ or STB, look at [R] sj or help stb. When a version of Stata is specified, it indicates the earliest version on which that program will run.
As this FAQ is fairly long, and not all readers may want to read all the way through, some repetition is built-in.
Let us look first at a relatively simple example. This crucially important question might appear in a questionnaire:
Which of the following software packages do you use for data analysis? | |
---|---|
1 | R |
2 | S-Plus |
3 | SAS |
4 | SPSS |
5 | Stata |
6 | others |
In this question, respondents are asked to mark the name of each package they use. Respondents may mark any number of packages. The number before each package name is used as a code in some coding schemes discussed below.
For many statistical analyses, the answers of the respondents are best coded as a set of indicator or dummy variables, something like this:
q1_R q1_SPlus q1_SAS q1_SPSS q1_Stata q1_others 1. 1 0 0 0 0 1 2. 1 1 0 0 1 0 3. 0 0 0 0 1 0 4. 0 0 1 0 0 0 5. 0 0 1 0 0 1
That is, there should be a variable for each possible answer, with value 1 if a respondent uses a specific package and 0 otherwise. The first respondent in this example uses R and some other package; the second respondent uses R, S-Plus, and Stata; and so on. We use names for the variables that have a common prefix. This is a small detail, but it makes it easier to refer to the variables collectively using a wildcard, such as q1_*.
Data on multiple responses with this coding scheme can be used immediately for many analyses. For example, you might want to know how many respondents use Stata. Type
. count if q1_Stata == 1
or type
. tabulate q1_Stata
You might want to see the distribution of the number of packages used by the respondents. This is just the row sum of the variables, most easily calculated by egen.
. egen npkg = rowtotal(q1_*) . tabulate npkg
You might want to know the distribution of users of software packages. One method is to summarize the variables and compare their means, but a better method is through tabstat.
. tabstat q1_*, s(sum) c(s)
The use of row sums and of variable sums across 1s and 0s underlines the value of holding data in indicator variable form.
A common variant is that the question asks you to rank choices, say, from most common to least common use, or in some other way.
q1_1 q1_2 q1_3 q1_4 q1_5 q1_6 1. 1 6 0 0 0 0 2. 5 2 1 0 0 0 3. 5 0 0 0 0 0 4. 3 0 0 0 0 0 5. 6 3 0 0 0 0
Thus using the coding scheme indicated previously, person 1 uses R most and some other package next. There is more information recorded in this variant form, as the first data structure can be obtained from this one, but not conversely. Two common variations on this scheme are to use numeric missing rather than 0 and to use string variables including names rather than numeric codes.
This structure evidently makes it easy to focus on which package is most commonly used. It makes it difficult to focus on which packages are used at all, and so forth.
We mention here a possibility that researchers may encounter or produce tied ranks in some projects. How best to handle tied ranks is not considered here.
Yet another situation is that answers have been coded in the order in which the respondent mentioned them. Such data may look like ranked multiple responses, but their interpretation may or may not be similar. In some fields, it appears common to take the order in which responses were mentioned as a tacit indication of an underlying order. For example, suppose you were asked to state brands of some item you purchase or you know about. Marketing people could be interested in what springs most readily to your mind. Whether “order of mention” is tantamount to ranking is a substantive matter for you to consider.
Sometimes the answers to multiple responses are put into one string variable. Commonly, this is the concatenation of the codes of possible (positive) answers. For our example data, such a variable could look like the following (using numeric codes):
spkg1 1. 16 2. 521 3. 5 4. 3 5. 63
Or it could look like the following:
spkg2 1. R others 2. Stata S-Plus R 3. Stata 4. SAS 5. SAS others
The variable spkg1 states for the first observation that this respondent uses the software packages 1 and 6, which means R (1) and some other package (6). This may look like a numeric variable, but it should be a string variable. In our experience, both producing such a variable from other variables and working with such a variable are much easier when it is a string variable than when it is an integer-valued numeric variable. In any case, as soon as the number of possibilities exceeds 10, you will need to punctuate to avoid ambiguity. Otherwise, someone mentioning symptoms 1 and 3 from a list would be treated the same as someone mentioning symptom 13: both would be represented by "13". Similarly, as in the example of spkg2, once nonnumeric characters are used then there is no reason not to include punctuation to make elements clearer, unless you are near the limiting size of string variables.
There are various issues that can arise in practice. If packages are being ranked, then "Stata others" has a different meaning from "others Stata" but not otherwise. In particular, with unranked data, be warned: values that to you are identical but nevertheless differ literally will be tabulated or counted separately. Similar comments apply to leading and trailing spaces, accidental misspellings, or inconsistencies in upper- and lowercase. In the latter situation, problems may be solved by working consistently, say, in lowercase with the aid of the lower() function. (See [D] functions.)
This structure is particularly useful for showing combinations of choices, say, in tables of the composite variable. As the number of possible answers grows, the number of possible combinations also grows rapidly. Even setting aside the possibility of ranking, k choices mean 2^{k} possible combinations. However, this is a fact whatever the data structure.
An important detail here is whether the variable really is a string variable or (despite our general advice) a numeric variable. When tabulating a string variable, Stata will sort "12" before "2"; when tabulating a numeric variable, Stata will sort 2 before 12. The convention that is better for you will depend on your purpose. Thus, with a string representation, all choices with 1 as first character will be tabulated adjacently, whereas with a numeric representation all choices coded by one digit will be tabulated adjacently. Either could be useful.
Another data structure holds all information in a single variable with repeated observations for each individual in the dataset. An example might be the following:
id q1 1. 1 R 2. 1 others 3. 2 R 4. 2 S-Plus 5. 2 Stata 6. 3 Stata 7. 4 SAS 8. 5 SAS 9. 5 Stata
In the jargon associated especially with the reshape command, this example is of a long data structure.
The answers, here in q1, could be held in a string variable or in a numeric variable with value labels attached. To make full use of the information in such data, an identifier variable, here id, is essential. An identifier variable was not needed for any of our earlier examples. There is no requirement to show zero or missing responses; that is, to make explicit the fact that the person with id 1 does not use programs other than those mentioned. Thus this data structure is economical as a way of holding multiple response data, but it is correspondingly awkward as a way of holding other data on the same individuals. Suppose, for example, that we were also holding data on individuals’ age, sex, and field of study. This information would be best held repeated for each observation, which is inefficient (but otherwise not especially problematic).
Data on multiple responses in this structure can be used immediately for many analyses. For example, you might want to know how many respondents use Stata. If q1 is a string variable, type
. count if q1 == "Stata"
or if q1 is a numeric variable in which Stata is represented by 5, type
. count if q1 == 5
Data in this structure may be used easily for analyses of subsets defined by separate answers, either a particular subset or several subsets. The information yielded by count, and more, is available by typing
. tabulate q1
which shows the distribution of users of software packages.
You might want to see the distribution of the number of packages used by the respondents. This is just the number of observations for each individual (distinct id) for which q1 is not missing. If q1 is never missing, this is yielded by typing
. by id, sort: generate npkg = _N
Irrespective of whether q1 is ever missing, this is yielded by typing
. by id, sort: egen npkg = count(q1)
as count() counts how often its argument is not missing; see [D] egen.
However, if this were followed by
. tabulate npkg
the individual with id 1 would be shown twice, that with id 2 three times, and so on. We need a way of selecting each id just once. An egen function is dedicated to this task, tag(). This function tags just one observation in each group of identical values with value 1 and any other observations in the same group with value 0.
. egen tag = tag(id) . tabulate npkg if tag
The idiom if tag as a contraction of if tag == 1 is always safe, as tag() never produces missing values. This device has many other uses whenever we wish to relate multiple response data to other data for each individual.
A final advantage of this structure is that it is also applicable to ranked multiple variables, given an extra variable holding ranks. It is then easy using, for example, generate, egen, tabulate, by:, and if to produce many basic analyses.
Despite some major advantages, this data structure is awkward for working with conditions specifying more than one answer. There are some ways to approach this, but they are not attractive. We can tag those who use both R and Stata in this way, illustrated by the case of string variables:
. by id, sort: egen R_and_Stata = total(q1 == "R" | q1 == "Stata") . replace R_and_Stata = R_and_Stata == 2
One part of the argument of total(), that is, q1 == "R", will pick up any observation for which this is true. The other part of the argument, that is, q1 == "Stata", will pick up any observation for which this is true. The sum of a result of 1 if each condition is satisfied just once for an individual should be 2. Naturally, that sum is not affected by any number of results of 0 arising whenever any condition is false. However, although we can make some progress with such questions and this data structure, other data structures are far superior whenever examining two or more answers simultaneously.
Missing values are likely to be common with multiple response data. Even if everybody answered the question—which is unusual in many surveys—everyone may not give the same number of responses. Even when asked to rank a fixed number of specified items, respondents often stop ranking when they are indifferent to items, perhaps through lack of experience or knowledge.
A related issue is the appropriate denominator in calculating proportions or percents. Again, there will almost always be a difference between “number of respondents” and “number of responses”. Either or both may be of substantive interest.
Here flag two pertinent details specific to Stata.
First, remember when working with integer variables that numeric missing counts as nonzero and therefore as true. For background, see the FAQ: “What is true and false in Stata?”. This can be especially important when trying to produce, or when working with, indicator variables for which the possible nonmissing values are just 1 and 0.
Second, egen, anymatch() and egen, anycount() never return missing results. We say more on this in 3.6.1 Many-to-many mappings: using egen.
You can concatenate variables by adding them as string variables or as the string equivalent of numeric variables. A tool specifically for this purpose is egen, concat(). See [D] egen for a more detailed discussion and examples. For example, given
q1_1 q1_2 q1_3 q1_4 q1_5 q1_6 1. 1 6 0 0 0 0 2. 5 2 1 0 0 0 3. 5 0 0 0 0 0 4. 3 0 0 0 0 0 5. 6 3 0 0 0 0
you can type
. egen response = concat(q1_*)
without worrying about whether the variables are numeric or string, as egen, concat() automatically converts to string equivalent. You might want to remove the zeros that pad out the result
response 1. 160000 2. 521000 3. 500000 4. 300000 5. 630000
which is easy with one of Stata's built-in string functions:
. replace response = subinstr(response,"0","",.)
Given a structure of indicator variables
q1_R q1_SPlus q1_SAS q1_SPSS q1_Stata q1_others 1. 1 0 0 0 0 1 2. 1 1 0 0 1 0 3. 0 0 0 0 1 0 4. 0 0 1 0 0 0 5. 0 0 1 0 0 1
you might prefer a concatenation more obviously interpretable than "100001", "110010", etc., which yields values like "R others":
. gen str1 q1 = "" . qui foreach p in R SPlus SAS SPSS Stata others { . replace q1 = q1 + "`p' " if q1_`p' == 1 . } . replace q1 = strtrim(q1)
For more detail on foreach, see foreach or a tutorial in Cox (2002).
First, let us suppose our data are
id q1_R q1_SAS q1_SPlus q1_Stata q1_others sex 1. 1 1 0 0 0 1 male 2. 2 1 0 1 1 0 female 3. 3 0 0 0 1 0 male 4. 4 0 1 0 0 0 female 5. 5 0 1 0 1 0 female
which is an example of what in reshape jargon is described as a wide data structure. The q1_* are numeric indicator variables. Later, we will comment on data in which ranks are given.
To convert this structure to a long data structure in which program choice is represented by a single variable, we need to use reshape. In addition to [D] reshape, also see the FAQ: "I am having problems with the reshape command. Can you give further guidance?".
The key to such reshape questions is to think in terms of a data matrix in which data are ordered by rows and columns, indexed conventionally in matrix algebra by i and j, respectively. The rows we have are defined by the distinct values of id and the columns we have are the variables q1_*. The variable names have in common a stub q1_, and they differ in the suffixes following the stub, R, SAS, etc. If the variable names do not have this stub plus suffix form, you will need to apply rename before you can apply reshape. For further discussion, see the FAQ just mentioned.
Our reshaping will be mapping the columns of the data matrix (variables q1_*) into one column, with other variables being rearranged to match. We specify the stub, and we also need to spell out that the data variable to be created will be string.
. reshape long q1_ , i(id) string
The result is
id _j q1_ sex 1. 1 R 1 male 2. 1 SAS 0 male 3. 1 SPlus 0 male 4. 1 Stata 0 male 5. 1 others 1 male 6. 2 R 1 female 7. 2 SAS 0 female 8. 2 SPlus 1 female 9. 2 Stata 1 female 10. 2 others 0 female 11. 3 R 0 male 12. 3 SAS 0 male 13. 3 SPlus 0 male 14. 3 Stata 1 male 15. 3 others 0 male 16. 4 R 0 female 17. 4 SAS 1 female 18. 4 SPlus 0 female 19. 4 Stata 0 female 20. 4 others 0 female 21. 5 R 0 female 22. 5 SAS 1 female 23. 5 SPlus 0 female 24. 5 Stata 1 female 25. 5 others 0 female
which is almost where we want to be. There is no point in being explicit about programs not used, so we
. drop if q1_ == 0
and follow by dropping that variable altogether and by using a more intuitive name:
. drop q1_ . rename _j q1
Here is the result:
id q1 sex 1. 1 R male 2. 1 others male 3. 2 R female 4. 2 SPlus female 5. 2 Stata female 6. 3 Stata male 7. 4 SAS female 8. 5 SAS female 9. 5 Stata female
As seen, we need not worry about variables such as sex that are constant within id. They will get carried along automatically.
We promised to look at data in which ranks were given.
id q1_1 q1_2 q1_3 sex 1. 1 R others male 2. 2 R S-Plus Stata female 3. 3 Stata male 4. 4 SAS female 5. 5 Stata SAS female
The data matrix we have has rows defined by the distinct values of id and columns, which are the variables q1_*. The new data structure will have a single variable indicating software rank, which can be done directly:
. reshape long q1_ , i(id) j(rank) (output omitted) . list +-----------------------------+ | id rank q1_ sex | |-----------------------------| 1. | 1 1 R male | 2. | 1 2 others male | 3. | 1 3 male | 4. | 2 1 R female | 5. | 2 2 S-Plus female | |-----------------------------| 6. | 2 3 Stata female | 7. | 3 1 Stata male | 8. | 3 2 male | 9. | 3 3 male | 10. | 4 1 SAS female | |-----------------------------| 11. | 4 2 female | 12. | 4 3 female | 13. | 5 1 Stata female | 14. | 5 2 SAS female | 15. | 5 3 female | +-----------------------------+
We do not need observations with missing q1, and we can clean up the variable name,
. drop if missing(q1_) . rename q1_ q1
resulting in
+-----------------------------+ | id rank q1 sex | |-----------------------------| 1. | 1 1 R male | 2. | 1 2 others male | 3. | 2 1 R female | 4. | 2 2 S-Plus female | 5. | 2 3 Stata female | |-----------------------------| 6. | 3 1 Stata male | 7. | 4 1 SAS female | 8. | 5 1 Stata female | 9. | 5 2 SAS female | +-----------------------------+
This example was of a string variable. Any value labels attached to a numeric variable survive the reshape, so it appears immaterial whether q1 is string or numeric with labels. (In practice, it is a good idea to ensure that the numeric variables in the data matrix have the same value labels.)
Given a composite variable, with values such as "125" or "Stata R", how can it be converted to a set of indicator variables? One answer lies in the strpos() function, one of Stata's string functions, which we will document at some length, partly because it is often useful for other problems as well. We assume here that you are following our advice and holding the codes as a composite string variable. If not, then in the examples below, use, e.g., strpos(string(varname)) rather than strpos(varname).
strpos() is used to find the position of one string within another. To find the position of the string "I" in the string "Where am I?", you can type
. display strpos("Where am I?", "I")
and Stata will return 10, meaning that the string "I" is found starting at the 10th position. What happens if you ask for the position of the string "you" in "Where am I?"? Since "you" is not included in the longer string, strpos() returns 0. More generally, a positive result from strpos() means that one string is included within another and a zero result means that it is not.
We can also feed to strpos() any expression that evaluates to a string, such as the name of a string variable, so that a new variable can be generated as follows:
. generate byte q1_1 = strpos(spkg1, "1") > 0
strpos(spkg1, "1") will return a positive number if "1" is included in a value of spkg1 and 0 otherwise. strpos(spkg1, "1") > 0 will in turn evaluate to 1 if true and to 0 if false, thus yielding an indicator variable. For background, see the FAQ "What is true and false in Stata?".
In passing the specification of a byte variable type, possible here because we know that the possible values are well within the limits for that data type; for more information, see data types. Using an economical data type for an indicator variable can be helpful whenever space is short.
We will want to generate similar variables for other answers. Doing this variable by variable can be avoided, for example, by using forvalues:
. forvalues i = 1/6 { . generate byte q1_`i' = strpos(spkg1, "`i'") > 0 . }
For more detail on forvalues, see forvalues or a tutorial in Cox (2002). A further extension would be something like
. forvalues i = 1/6 { . capture assert strpos(spkg1, "`i'") == 0 . if _rc { . generate q1_`i' = strpos(spkg1, "`i'") > 0 . } . }
What is going on here? Any statement tested by assert will yield a so-called return code that is zero if the statement is true for all observations examined and a return code that is nonzero (in fact, 9) if it is false. We test to see if any observations contain values other than zero before we generate a new variable. The capture ensures that everything continues smoothly, whatever the outcome.
In particular, in our dataset nobody uses SPSS, so, arguably, we could dispense with an indicator variable for that choice. When we get to
assert strpos(spkg1, "4") == 0
this assertion will be true of all the data, and the return code from assert will be 0. So, the return code—which is accessible in _rc—will be nonzero and thus true. More generally, this approach will avoid creation of variables for any choices that were possible but happen to have been chosen by none of the sample.
This approach will work well with choices coded by one-digit characters, numeric or otherwise. You need to be more careful, however, when the choices include say "1", "10", "11", as a search for the character "1" will then find it whenever it occurs as part of "10", "11", and so forth. Given space separation, as "1 10 11", one possibility is to search for " 1 " within the string expression " " + string_variable + " ". Another possibility is to split the variable into "words" and then work from the resulting variables. This possibility is explained in more detail in the next subsection. Typically easier, however, are unambiguous strings, as exemplified by
. foreach p in R S-Plus SAS SPSS Stata others { . local P : subinstr local p "-" "" . gen byte q1_`P' = strpos(spkg2, "`p'") > 0 . }
which generates the variables q1_R, q1_SPlus and so forth, with values 1 and 0 just like in the example before. (Incidentally, for S-Plus we need to catch the hyphen, which may not appear as a character in a variable name.) Again this is all totally literal and thus dependent on consistent spelling, use of spaces, and use of upper- and lowercase. On that last point alone, we can be more broad-minded in this way,
. foreach p in S-Plus SAS SPSS Stata others { . local P : subinstr local p "-" "" . gen byte q1_`P' = strpos(lower(spkg2), lower("`p'")) > 0 . }
but we need a separate approach for R, given that "r" is evidently part of “others”.
Finally, you may catch choices never made, just as before:
. foreach p in R S-Plus SAS SPSS Stata others { . local P : subinstr local p "-" "" . capture assert strpos(lower(spkg2), lower("`p'")) == 0 . if _rc { . gen byte q1_`P' = strpos(lower(spkg2), lower("`p'")) > 0 . } . }
A composite string variable with values such as "125" or "43" can be split into individual str1 variables by a simple loop. You just need to find out the length of the composite, say, from describe. Suppose that you want to split a str7 variable:
. forvalues i = 1/7 { . gen str1 r`i' = substr(response,`i',1) . }
A composite string variable with values such as "Stata R" or “coffee,beer”, in which words or phrases or other elements are separated by some punctuation, say, a space or a comma, is best handled by another approach. In Stata 8 or later versions, this can be done with the split command. In Stata 7, you can use the predecessor of that command, split by Nicholas J. Cox from SSC. In Stata 6, you can use the predecessor of that command, strparse by Michael Blasnik and Nicholas J. Cox from SSC.
First, let us suppose that our data are like
id q1 sex 1. 1 R male 2. 1 others male 3. 2 R female 4. 2 S-Plus female 5. 2 Stata female 6. 3 Stata male 7. 4 SAS female 8. 5 SAS female 9. 5 Stata female
which is an example of what was earlier described as a long data structure. To resolve an ambiguity, let us specify that q1 is a string variable. Later, we will comment on the case of a numeric variable with value labels. Finally, we will comment on data in which ranks are given.
To convert this structure to a wide data structure in which each distinct answer in q1 is represented by a single variable, we need to use reshape. In addition to [D] reshape, also see the FAQ: "I am having problems with the reshape command. Can you give further guidance?".
The key to such reshape questions is to think in terms of a data matrix in which data are ordered by rows and columns, indexed conventionally in matrix algebra by i and j, respectively. The rows we desire are defined by the distinct values of id and the columns we desire are defined by the distinct values of q1. Those values will be used as the suffixes of a set of variables. If q1 is a string variable, we immediately have a small problem: the "-" within S-Plus is not acceptable within a variable name. We could fix this by
. replace q1 = subinstr(q1,"-","_",.)
or in more difficult situations, we could encode a string variable into a numeric variable. In the matrix itself, we want indicator variables in which 1 represents yes and 0 no. All our observations at present are in effect instances of 1, but we need to make that explicit:
. gen byte one = 1
That creates a variable that is 1 in every observation. In most circumstances, such a variable would be pointless, but here it is essential. The variable is created as a byte variable to economize on storage. You can dispense with this detail if you have plenty of memory to spare.
Now we can reshape the data:
. reshape wide one, i(id) j(q1) string
We need not worry about variables such as sex that are constant within id. They will get carried along automatically. (If, contrary to assumption, they are not constant within id, then you will get an error message and no reshape, as something that should be true of your data is in fact false.) Here is the result of the reshape:
id oneR oneSAS oneS_Plus oneStata oneothers sex 1. 1 1 . . . 1 male 2. 2 1 . 1 1 . female 3. 3 . . . 1 . male 4. 4 . 1 . . . female 5. 5 . 1 . 1 . female
We are almost done, but, depending on taste, there may be some cleaning up to do. First, we have a stub for the new variables that may not be to our liking. One specific way to fix that is with rename:
. rename one* q1_*
Second, we may wish to change all the missings in q1_* to 0. Once again, a specific command can do this, mvencode:
. mvencode q1_*, mv(0)
We promised to comment on the case in which the argument of j(), here q1, is a numeric variable with value labels attached. The code is similar to the previous commands:
. gen byte one = 1 . reshape wide one, i(id) j(q1) . rename one* q1_* . mvencode q1_*, mv(0)
However, a side effect of reshape here is that the value labels associated with q1 get dropped. For this reason, using a string variable is attractive whenever practicable, bearing in mind that the values of the string variable are destined to be variable name suffixes; hence, only alphabetical, numeric, and underscore characters are allowed.
We also promised to look at data in which ranks were given, which is even easier.
id q1 sex rank 1. 1 R male 1 2. 1 others male 2 3. 2 R female 1 4. 2 S-Plus female 2 5. 2 Stata female 3 6. 3 Stata male 1 7. 4 SAS female 1 8. 5 SAS female 2 9. 5 Stata female 1
The data matrix we seek has rows defined by the distinct values of id and columns defined by the distinct values of rank. In the matrix itself, we want variables indicating software, which can be done directly:
. reshape wide q1, i(id) j(rank) . rename q1* q1_*
In this problem, any value labels attached to a numeric variable q1 do survive the reshape, so it appears immaterial whether q1 is string or numeric with labels.
The most common problem here seems to be the creation of indicator variables from variables indicating successive choices. One pertinent tool in official Stata for the case of integer codes held in numeric variables is egen, anycount(). The result can be thought of as number of variables equal to any of the values specified. A sibling is egen, anymatch(). The result can be thought of indicating whether values of variables are equal to any of the values specified.
For example, given the ranked responses
q1_1 q1_2 q1_3 q1_4 q1_5 q1_6 1. 1 6 0 0 0 0 2. 5 2 1 0 0 0 3. 5 0 0 0 0 0 4. 3 0 0 0 0 0 5. 6 3 0 0 0 0
we can generate the corresponding variables:
. forvalues i = 1/6 { . egen Q1_`i' = anycount(q1_*), val(`i') . }
First, we loop over the possible answers (the values of the data), here the integers 1/6. More complicated sets of answers might be better handled using foreach. For each possible answer, in turn 1 2 3 4 5 6, we count how many of the variables—here the q1_*—are equal to any of the values specified—here just a single value in each case. We also use uppercase Q1 as a prefix for the new variables. Above all, appreciate that the new variables do not retain all the information in the originals, as we are ignoring the information on rank order.
Using anycount() rather than anymatch() is a small wrinkle. With this example, we expect that each package will be mentioned at most once, but counting with anycount() allows a data check. Any multiple count will show up as a value of 2 or more, and we can identify any respondent trying to subvert the questionnaire by repeatedly mentioning their favorite software—or, if it seems appropriate, treat that as a measure of strength of interest.
Naturally, if you prefer, you can use anymatch(). This function is guaranteed to produce an indicator variable with values 1 or 0.
If you choose either of these functions, as mentioned earlier, neither function ever produces missing values as a result. This may be surprising, but it was intended as a feature, given what was seen as the most likely uses of the generated new variables and how they might appear within Stata commands. However, if all the variables supplied as arguments are missing in an observation, then the result of anymatch() or anycount() will be 0 for that observation. If you want to recode such 0s to numeric missing, here is one way to do it. We exploit the fact that the observation-wise (rowwise) maximum will be returned as missing by egen, rowmax() if and only if all values examined in an observation are missing.
. egen rmax = rowmax(q1_*) . forvalues i = 1/6 { . replace Q1_`i' = . if rmax == . . }
A crucial limitation is that both functions anymatch() and anycount() apply only to integer codes. With arbitrary string codes, say,
q1_1 q1_2 q1_3 q1_4 q1_5 q1_6 1. R others 2. Stata S-Plus 3. Stata 4. SAS 5. others SAS
we need to create our own numeric measures from first principles; for example,
. foreach p in R S-Plus SAS SPSS Stata others { /* loop over responses */ . local P : subinstr local p "-" "" . gen byte q1_`P' = 0 . forval i = 1/6 { /* loop over existing variables */ . qui replace q1_`P' = q1_`P' + (strpos(q1_`i',"`p'") > 0) . } . }
Here we need a double loop, one over possible responses, initializing a variable to 0, and one over existing variables, adding 1 each time we find the package name inside. Counting whether strpos() returns a positive count is here a more general than testing for equality, as it guards against the possibility that leading and/or trailing spaces have somehow been added to the variable. Nothing is done here directly about consistency of case—we have already seen how to tackle that—or catching misspellings.
The code example here uses addition to produce an analog of egen, anycount(). One way of producing an analog of egen, anymatch() is to use the or operator |, as 0 | 1 and 1 | 1 both yield 1. For context, see operators.
A program zb_qrm by Eric Zbinden (SSC; Stata 5) maps from a set of numeric variables with codes 1 upwards to a set of indicator variables for those codes. It also displays information on the occurrence pattern of indicators.
A program mrdum by Lee Sieswerda (SSC; Stata 7) is similar but is on the whole more general.
These programs differ over what is an appropriate denominator, all observations or all observations containing at least one response. As flagged previously, various choices may be sensible depending on the problem being tackled.
Tabulation itself is a large and complex subject. Our aim in this section is just to give some pointers to commands that may be of use.
Stata’s official commands do not give much support to multiple response variables, although we gave an example earlier of the application of tabstat. One general strategy is to use an egen function to calculate something, (possibly) egen, tag() to tag just one observation in each of several groups, and then list to show the results. Using collapse or contract followed by list is more drastic.
Alternatively, user-written commands in this territory include
Lee Sieswerda made several helpful comments on a draft.