Sometimes, missing data are stored as impossible values such as "age = -99". We may recognize these as missing data, but we need to tell Stata to treat them as missing values.
Let's begin by opening, describing, and summarizing an example dataset from the Stata website.
. use https://www.stata.com/users/youtube/rawdata.dta, clear (Fictitious data based on the National Health and Nutrition Examination Survey) . describe Contains data from https://www.stata.com/users/youtube/rawdata.dta Observations: 1,268 Fictitious data based on the National Health and Nutrition Examination Survey Variables: 10 6 Jul 2016 11:17 (_dta has notes)
Variable Storage Display Value name type format label Variable label |
id str6 %9s Identification Number age byte %9.0g sex byte %9.0g Sex race str5 %9s Race height float %9.0g height (cm) weight float %9.0g weight (kg) sbp int %9.0g Systolic blood pressure (mm/Hg) dbp int %9.0g Diastolic blood pressure (mm/Hg) chol str3 %9s serum cholesterol (mg/dL) dob str18 %18s |
Variable | Obs Mean Std. dev. Min Max | |
id | 0 | |
age | 1,268 48.44795 16.97613 20 74 | |
sex | 1,268 .466877 .4990985 0 1 | |
race | 0 | |
height | 1,268 167.1711 9.607279 144 193 | |
weight | 1,268 72.17593 16.27347 39.12 175.88 | |
sbp | 1,268 131.1554 29.43287 65 720 | |
dbp | 1,268 80.26104 15.69713 -99 150 | |
chol | 0 | |
dob | 0 |
The description of the variable dbp tells us that it contains data for diastolic blood pressure. The summary for dbp tells us that the minimum value for dbp is -99.
Let's list the first five observations of dbp to investigate further.
. list dbp in 1/5
dbp | |
1. | -99 |
2. | -99 |
3. | 76 |
4. | 94 |
5. | 74 |
A dbp of -99 is biologically impossible, and perhaps we know from the data documentation that "-99" represents a missing value. We can tell Stata to treat these observations as missing data using mvdecode.
. mvdecode dbp, mv(-99) dbp: 2 missing values generated
Note that you can include multiple variables after mvdecode if -99 represents missing values for multiple variables. Let's list the first five observations and summarize dbp to check our work.
. list dbp in 1/5 . summarize dbp
Variable | Obs Mean Std. dev. Min Max | |
dbp | 1,266 80.54423 13.99656 35 150 |
The first two observations for dbp were "-99", and now they are displayed as ., which is known as a "system missing value". The . can be thought of as "positive infinity". You can specify different kinds of missing values, and you can learn more by typing help missing.
There is a function named missing() that can help you identify missing values in a variable. For example, you could type list dbp if missing(dbp) to see a list of all missing values of dbp.
. list dbp if missing(dbp)
dbp | |
1. | . |
2. | . |
You can also type !missing() to identify observations that are not missing values. For example, we could list the first five nonmissing observations for dbp.
. list dbp in 1/5 if !missing(dbp)
dbp | |
3. | 76 |
4. | 94 |
5. | 74 |
You will need to be careful to exclude missing values when using conditions that include "greater than". For example, our results will differ if we count the number of observations where dbp is greater than 80 and forget to exclude missing values. Missing values are like "infinity", which is greater than 100. And dbp includes two missing values.
. count if dbp > 100 88 . count if dbp > 100 & !missing(dbp) 86
You can watch a demonstration of these commands by clicking on the link to the YouTube video below. You can read more about these commands by clicking on the links to the Stata manual entries below.
Watch Data management: How to convert missing value codes to missing values.
Read more in the Stata Data Management Reference Manual; see [D] describe, [D] mvencode, and [D] save. In the Stata Data Management Reference Manual, see [R] summarize.