Home  /  Resources & Support  /  Introduction to Stata basics  /  How to convert numbers to missing values

Sometimes, missing data are stored as impossible values such as "age = -99". We may recognize these as missing data, but we need to tell Stata to treat them as missing values.

Let's begin by opening, describing, and summarizing an example dataset from the Stata website.

. use https://www.stata.com/users/youtube/rawdata.dta, clear
(Fictitious data based on the National Health and Nutrition Examination Survey)

. describe

Contains data from https://www.stata.com/users/youtube/rawdata.dta
 Observations:         1,268                  Fictitious data based on the
                                                National Health and Nutrition
                                                Examination Survey
    Variables:            10                  6 Jul 2016 11:17
                                              (_dta has notes)
Variable Storage Display Value name type format label Variable label
id str6 %9s Identification Number age byte %9.0g sex byte %9.0g Sex race str5 %9s Race height float %9.0g height (cm) weight float %9.0g weight (kg) sbp int %9.0g Systolic blood pressure (mm/Hg) dbp int %9.0g Diastolic blood pressure (mm/Hg) chol str3 %9s serum cholesterol (mg/dL) dob str18 %18s
Sorted by: id . summarize
Variable Obs Mean Std. dev. Min Max
id 0
age 1,268 48.44795 16.97613 20 74
sex 1,268 .466877 .4990985 0 1
race 0
height 1,268 167.1711 9.607279 144 193
weight 1,268 72.17593 16.27347 39.12 175.88
sbp 1,268 131.1554 29.43287 65 720
dbp 1,268 80.26104 15.69713 -99 150
chol 0
dob 0

The description of the variable dbp tells us that it contains data for diastolic blood pressure. The summary for dbp tells us that the minimum value for dbp is -99.

Let's list the first five observations of dbp to investigate further.

. list dbp in 1/5

dbp
1. -99
2. -99
3. 76
4. 94
5. 74

A dbp of -99 is biologically impossible, and perhaps we know from the data documentation that "-99" represents a missing value. We can tell Stata to treat these observations as missing data using mvdecode.

. mvdecode dbp, mv(-99)
         dbp: 2 missing values generated

Note that you can include multiple variables after mvdecode if -99 represents missing values for multiple variables. Let's list the first five observations and summarize dbp to check our work.

. list dbp in 1/5

. summarize dbp

Variable Obs Mean Std. dev. Min Max
dbp 1,266 80.54423 13.99656 35 150

The first two observations for dbp were "-99", and now they are displayed as ., which is known as a "system missing value". The . can be thought of as "positive infinity". You can specify different kinds of missing values, and you can learn more by typing help missing.

There is a function named missing() that can help you identify missing values in a variable. For example, you could type list dbp if missing(dbp) to see a list of all missing values of dbp.

. list dbp if missing(dbp)

dbp
1. .
2. .

You can also type !missing() to identify observations that are not missing values. For example, we could list the first five nonmissing observations for dbp.

. list dbp in 1/5 if !missing(dbp)

dbp
3. 76
4. 94
5. 74

You will need to be careful to exclude missing values when using conditions that include "greater than". For example, our results will differ if we count the number of observations where dbp is greater than 80 and forget to exclude missing values. Missing values are like "infinity", which is greater than 100. And dbp includes two missing values.

. count if dbp > 100
  88

. count if dbp > 100 & !missing(dbp)
  86

You can watch a demonstration of these commands by clicking on the link to the YouTube video below. You can read more about these commands by clicking on the links to the Stata manual entries below.