Sometimes, we wish to create categorical variables from continuous variables. For example, we may wish to create a variable for high versus low systolic blood pressure from a continuous variable for systolic blood pressure.
Let's begin by opening and describing an example dataset from the Stata website.
. use https://www.stata.com/users/youtube/rawdata.dta, clear (Fictitious data based on the National Health and Nutrition Examination Survey) . describe Contains data from https://www.stata.com/users/youtube/rawdata.dta Observations: 1,268 Fictitious data based on the National Health and Nutrition Examination Survey Variables: 10 6 Jul 2016 11:17 (_dta has notes)
Variable Storage Display Value name type format label Variable label |
id str6 %9s Identification Number age byte %9.0g sex byte %9.0g Sex race str5 %9s Race height float %9.0g height (cm) weight float %9.0g weight (kg) sbp int %9.0g Systolic blood pressure (mm/Hg) dbp int %9.0g Diastolic blood pressure (mm/Hg) chol str3 %9s serum cholesterol (mg/dL) dob str18 %18s |
The description tells us the variable sbp contains measurements of systolic blood pressure measured in millimeters of mercury (mmHg). Let's summarize sbp.
. summarize sbp
Variable | Obs Mean Std. dev. Min Max | |
sbp | 1,268 131.1554 29.43287 65 720 |
The miniumum value of sbp is 65 and the maximum is 720. sbp is a continuous variable, but sometimes researchers dichotomize systolic blood pressure into the categories "less than or equal to 120 mmHg" and "greater than 120 mmHG". We can create this dichotomous variable using Stata's recode command.
. recode sbp (min/120 = 0) (120/max = 1), gen(hisbp) (1,268 differences between sbp and hisbp)
The recode command creates a new variable named hisbp from sbp. It maps values of sbp from the minimum to 120 to category "0" in the variable hisbp. And it maps values of sbp greater than 120 to the maximum value to category "1" in the variable hisbp. Let's use Stata's summarize command with bysort to check our work. You can type help bysort if you are not familiar with the bysort command.
. bysort hisbp: summarize sbp
-> hisbp = 0 | ||
Variable | Obs Mean Std. dev. Min Max | |
sbp | 509 109.0216 9.226669 65 120 |
-> hisbp = 1 | ||
Variable | Obs Mean Std. dev. Min Max | |
sbp | 759 145.9987 29.00644 122 720 |
The minimum and maximum values of sbp are 65 and 120, respectively, for category "0" of hisbp. And the minimum and maximum values of sbp are 122 and 720, respectively, for category "1" of hisbp.
We could have labeled the categories of hisbp in our recode command. Let's use recode again to create a new variable named hisbp2. This time type labels with double quotes inside the parentheses after each category number. Then type label list to view the label definition.
. recode sbp (min/120 = 0 "<=120") (120/max = 1 ">120"), gen(hisbp2) (1,268 differences between sbp and hisbp2) . label list hisbp2: 0 <=120 1 >120
Category "0" is labeled "<=120" and category "1" is labeled ">120". Note that recode created the label definition hisbp2 and attached it to the variable hisbp2. It has done the work of label define and label values for us (type help label if you are not familiar with the label commands).
Let's list observations 507 through 511 to see the labels in our dataset.
. list sbp hisbp hisbp2 in 507/511
sbp hisbp hisbp2 | |
507. | 106 0 <=120 |
508. | 98 0 <=120 |
509. | 100 0 <=120 |
510. | 146 1 >120 |
511. | 128 1 >120 |
We used recode to create two categories of sbp, but we could have created three or more. Here's a quick example using triple slashes to type the command on multiple lines.
. recode sbp (min/100 = 1 "<=100") /// (100/129 = 2 "100-120") /// (120/max = 3 ">120") /// , gen(hisbp3)
You can watch a demonstration of these commands by clicking on the link to the YouTube video below. You can read more about these commands by clicking on the links to the Stata manual entries below.
Watch Data management: How to create a categorical variable from a continuous variable.
Read more in the Stata Data Management Reference Manual; see [D] by, [D] describe, [D] label, [D] list, and [D] recode. In the Stata Programming Reference Manual, see [P] comments. And in the Stata Base Reference Manual, see [R] summarize.