Home  /  Stata News  /  Vol 38 No 4  /  In the spotlight: Creating color-coded twoway graphs
The Stata News

«Back to main page

In the spotlight: Creating color-coded twoway graphs

Do you want to identify age groups in your scatterplot by using different colors for the plotted points? Or want to identify income levels in your bar graph by different bar colors?

When we create a twoway plot with numeric variables y and x, it is often useful to color-code the plot based on values of another variable. This allows us to see how the relationship between y and x differs for each level of the third variable. Before Stata 18, you could create these types of graphs with twoway contour or twoway contourline or by overlaying twoway plots for each level of the third variable. Now you can easily create these graphs with the colorvar options introduced in Stata 18. You can color-code a wide variety of twoway plots, including scatterplots, bar graphs, dot plots, dropped-line plots, connected plots, spike plots, and several range and paired-coordinate plots. The colorvar options allow you to customize the number of levels, the colors that are used to represent each level, and how they are presented in the legend.

How does it work?

Let’s plot some measures of health using data from the Second National Health and Nutrition Examination Survey (NHANES II) (McDowell et al. 1981). First, we’ll create a scatterplot of height and weight and color-code the points based on the body mass index (bmi):

. webuse nhanes2, clear

. scatter height weight, colorvar(bmi)

ex_1.svg

BMI is defined based on weight and height, and color-coding the points helps us visualize how BMI changes with the values of these variables. For any given height, we see how BMI increases with weight.

The range of values for bmi is divided into six equally spaced right-inclusive intervals: (10, 20], (20, 30], (30, 40], (40, 50], (50, 60], and everything greater than 60. Each interval is assigned a color, and each point is colored based on the interval it belongs to. A contour plot legend (clegend) is used, which means a z axis is used to display the values of bmi.

We have a lot of flexibility in how the intervals of BMI are formed and what colors are used. For example, we would like to categorize bmi based on the cutoffs provided by the World Health Organization (WHO):

Category BMI (kg/m2)[c]
Underweight <= 18.4
Normal 18.5 – 24.9
Overweight 25.0 – 29.9
Obese I 30.0 – 34.9
Obese II 35.0 – 39.9
Obese III ≥ 40.0

We can use the colorcuts() option to obtain this grouping; we'll simply specify the right endpoint for each interval, and these values will be used as the cutpoints. Note that the intervals will be right inclusive.

. scatter height weight, colorvar(bmi) colorcuts(18.4 24.9 29.9 34.9 39.9)

ex_2.svg

If you already have a categorical variable in your dataset, you can skip the step of specifying cutpoints and simply color-code your twoway graphs using those categories. For example, below we create the categorical variable bmicat and its value label bmicategory.

. generate bmicat = irecode(bmi, 18.4, 24.9, 29.9, 34.9, 39.9) + 1

. label define bmicategory 1 "Underweight (<18.5)"
     2 "Normal (18.5-24.9)" 3 "Overweight (25.0-29.9)" 
     4 "Obese I (30.0 - 34.9)" 5 "Obese II (35.0 - 39.9)"
     6 "Obese III (>=40.0)"

. label values bmicat bmicategory

We can now use this categorical variable to color-code our scatterplot and use the value label in our legend.

. scatter height weight, colorvar(bmicat)
     colordiscrete  coloruseplegend
     zlabel(, valuelabel)
     colorlist(gold%40 blue*0.5%40
     blue%40 orange%40
     red*0.5%40 red%40)

ex_3.svg

We are now working with discrete values (1, 2, 3, 4, 5, and 6) rather than intervals of BMI. Therefore, we specify the colordiscrete option so that a color is assigned to each distinct value.

Additionally, we use the coloruseplegend option to opt for a contour-line plot legend (plegend) instead of the default contour plot legend. And the zlabel(, valuelabel) option specifies that the value label for bmicat be used to label the keys in the legend.

Finally, we use the colorlist() option to specify the list of colors to be used for the categories rather than using the default colors.

The coloruseplegend option is particularly useful if you would like to reorder the legend keys. For example, below we reverse the order of the items in the legend:

. scatter height weight, colorvar(bmicat)     
     colordiscrete  coloruseplegend
     zlabel(, valuelabel)
     colorlist(gold%40 blue*0.5%40
     blue%40 orange%40
     red*0.5%40 red%40)
     plegend(order(6 5 4 3 2 1))

ex_4.svg

It is also useful to collapse the categories. For example, below we use one color for “Obese I”, “Obese II”, and “Obese III” and one key to combine the three levels into one.

. scatter height weight, colorvar(bmicat)
     colordiscrete  coloruseplegend
     zlabel(, valuelabel)
     colorlist(gold%40 blue*0.5%40
     blue%40 red%40
     red%40 red%40)
     plegend(order(6 5 4 3 "Obese (>=30.0)"))

ex_5.svg

For the last example, we revisit Chuck Huber's Stata News article “Visualizing continuous-by-continuous interactions with margins and twoway contour”.

In this article, Chuck fit a logistic regression model for high blood pressure, highbp, with continuous covariates age and weight and their interaction. Then he used margins to estimate the predicted probability of hypertension for combinations of age and weight, with values of age ranging from 20 to 80 years in increments of 5 and values of weight ranging from 40 to 180 kilograms in increments of 5. Then he used twoway contour to plot the resulting predictions:

. svy: logistic highbp age weight c.age#c.weight
(running logistic on estimation sample)

Survey: Logistic regression

Number of strata = 31                            Number of obs   =      10,351
Number of PSUs   = 62                            Population size = 117,157,513
                                                 Design df       =          31
                                                 F(3, 29)        =      418.97
                                                 Prob > F        =      0.0000

Linearized
highbp Odds ratio std. err. t P>|t| [95% conf. interval]
age 1.100678 .0088786 11.89 0.000 1.082718 1.118935
weight 1.07534 .0063892 12.23 0.000 1.062388 1.08845
c.age#
c.weight .9993975 .0001138 -5.29 0.000 .9991655 .9996296
_cons .0002925 .0001194 -19.94 0.000 .0001273 .0006724
Note: _cons estimates baseline odds.
. quietly margins, at(age=(20(5)80) weight=(40(5)180))
     vce(unconditional) saving(predictions, replace)

. use predictions, clear
(Created by command margins; also see char list)

. rename _at1 age

. rename _at2 weight

. rename _margin pr_highbp

. twoway contour pr_highbp weight age

ex_6.svg

Now with the colorvar options, we have another way to visualize these predictions. Below, we plot the predicted probabilities for each value of weight, using different colors to represent the values of age.

. scatter pr_highbp weight, colorvar(age)
     colorlist(blue*0.5 blue orange red)
     title("Probability of Hypertension by Weight and Age")

ex_7.svg

In fact, we can modify this further. Because we have predicted probabilities for only 13 values of age, we can specify that the z axis contain 13 labels, one for each value we specified with margins.

. scatter pr_highbp weight, colorvar(age)
     colordiscrete zlabel(#13)
     title("Probability of Hypertension by Weight and Age")

ex_8.svg

In addition to the examples we have seen, you can also specify how many levels you want to be created, that different hues or intensities be used to represent the levels, and the color used to represent missing values.

Reference

McDowell, A., A. Engel, J. T. Massey, and K. Maurer. 1981. Plan and operation of the Second National Health and Nutrition Examination Survey, 1976–1980. Vital and Health Statistics 1(15): 1–144.

You can learn more in the Graphics Reference Manual.

— Hua Peng
Executive Director, Software Engineering and Data Science

— Gabriela Ortiz
Senior Applied Econometrician

«Back to main page