Home  /  Products  /  Features  /  Commands for working with ICD codes

<-  See Stata's other features


  • Designed for use with

    • The US National Center for Health Statistics (NCHS) ICD-10-CM diagnosis codes for healthcare encounter and claims data

    • The US Centers for Medicare and Medicaid Services (CMS) ICD-10-PCS procedure codes for healthcare claims data

    • The World Health Organization's ICD-10 codes for morbidity and mortality reporting

    • NCHS ICD-9-CM diagnosis codes for healthcare encounter and claims data

    • CMS ICD-9-CM procedure codes for healthcare claims data

  • Suite of data-management commands lets you

    • Easily generate new variables based on codes

      • Indicators for different conditions

      • Short descriptions

      • Category codes from billable codes

      • And more

    • Verify that a variable contains valid codes and flag invalid codes

    • Standardize the format of codes

  • Interactive utilities let you

    • Look up descriptions for codes

    • Search for codes from keywords

  • ICD-10 and ICD-10-CM/PCS commands let you indicate the version of the codes in your dataset

Information about diagnoses and procedures in administrative healthcare data is often encoded using one of the ICD coding systems. For example, the standard system for mortality reporting has been the World Health Organization's ICD-10 system since 1999. Since October of 2015, the U.S. has used ICD-10-CM to encode diagnoses and ICD-10-PCS to encode procedures.

When administrative data are gathered from multiple sources, the format of the codes may not be fully standardized. Thus, there may be reporting errors. Finally, the sheer number of codes available in these encoding systems means that analyzing the data in a meaningful way is often impossible without summarizing information.

Stata has a suite of commands for working with ICD codes, known collectively as the icd commands. Whether you want to add text to codes or create indicator variables, want to verify that the codes in your data are valid, or are using the codes as a step in a larger project, the icd commands provide valuable tools for reporting and research.

Let's see it work

Suppose we are conducting a study of mortality in the United States in 2010. We have vital statistics data from the CDC that contain records on more than 2.4 million deaths.

. use female agerc cause place using vital10.dta, clear
(US mortality data, 2010 -- CDC Vital Statistics)

. describe 

Contains data from vital10.dta
  Observations:     2,472,542                  US mortality data, 2010 -- CDC Vital Statistics
     Variables:             4                  6 Apr 2015 10:22
Variable Storage Display Value
name type format label Variable label
female float %9.0g female Decedent is female, female=1, male=0
place byte %8.0g pod Place of death and status
cause str4 %9s Cause of death (ICD-10 code)
agerc float %14.0g agerc Age, Census recode
Sorted by:

We want to identify all deaths that are due to respiratory illnesses. Any of 275 codes can currently be used to define a respiratory illness, far more than we would ever want to type! A plausible alternative is to use a lookup table, but definitions are often provided in terms of a range of codes, leaving you to type the codes at least once to create the lookup table anyway.

Because the CDC reported mortality using ICD-10 codes in 2010, we can use the icd10 commands to make our work easier.

We might want to start by verifying that all of the codes in our data are indeed valid codes and use the same format for storage. The default version for icd10 is codes from 2019, but we need to make sure we specify the version that applies for our data.

We start by using icd10 check with version(2010).

. icd10 check cause, version(2010) 
(cause contains no missing values)

cause contains undefined codes:

    1.  Invalid placement of period                       0
    2.  Too many periods                                  0
    3.  Code too short                                    0
    4.  Code too long                                     0
    5.  Invalid 1st char (not A-Z)                        0
    6.  Invalid 2nd char (not 0-9)                        0
    7.  Invalid 3rd char (not 0-9)                        0
    8.  Invalid 4th char (not 0-9)                        0
   77.  Valid only for previous versions             15,177
   88.  Valid only for later versions                     0
   99.  Code not defined                                  0
        Total                                        15,177

However, we discover that more than 15,000 records are using codes from a previous year. Out of 2.4 million, that isn't such a bad error rate, but if we used 2009, could we do better? Let's specify version(2009) and get a list of the codes with any problems. We'll also create a variable that indicates the type of problem that icd10 check finds.

. icd10 check cause, version(2009) generate(problem09) summary
(cause contains no missing values)

cause contains undefined codes:

    1.  Invalid placement of period                       0
    2.  Too many periods                                  0
    3.  Code too short                                    0
    4.  Code too long                                     0
    5.  Invalid 1st char (not A-Z)                        0
    6.  Invalid 2nd char (not 0-9)                        0
    7.  Invalid 3rd char (not 0-9)                        0
    8.  Invalid 4th char (not 0-9)                        0
   77.  Valid only for previous versions                  0
   88.  Valid only for later versions                 2,188
   99.  Code not defined                                  0
        Total                                         2,188

Summary of invalid and undefined codes

cause Count Problem
A099 2135 Valid only for later versions
R636 43 Valid only for later versions
R263 10 Valid only for later versions

We think all respiratory diagnoses fall in the range of J10 to J98.9, but we might want to look up these three codes real quick just to set our minds at ease.

. icd10 lookup A09.9 R63.6 R26.3

    A09.9 Gastroenteritis and colitis of unspecified origin
    R26.3 Immobility
    R63.6 Insufficient intake of food and water due to self neglect 

We're fairly confident that we can ignore these codes. They don't apply to our study of respiratory illness. We'll use 2009 as our reference year for the rest of our analysis. To generate our respiratory cause of death indicator (resp), we type

. icd10 generate resp = cause, range(J10/J989)

We may wish to further examine deaths from pneumonia. For example, if we want to add an indicator for a pneumonia cause of death only to those decedents that we already know have a respiratory diagnosis, we can type

. icd10 generate pneumonia = cause if resp==1, range(J12/J189)

. tabulate pneumonia

pneumonia Freq. Percent Cum.
0 187,594 79.07 79.07
1 49,660 20.93 100.00
Total 237,254 100.00

We see that about 21% of all deaths from respiratory illnesses in the US in 2010 were from pneumonia.

Now suppose we were giving a presentation and wanted to show a graph of common pneumonia diagnoses among decedents with a pneumonia-related cause of death. For this, we'll want to combine the icd10 generate command with a few other data management commands and Stata's graph hbar command.

First let's speed things up a bit by using contract to make a dataset of frequencies. Because we only want to graph the relative frequency of each pneumonia code within all pneumonia diagnoses, we can discard the other records. We'll also extract the category code from each cause of death.

. contract cause if pneumonia == 1, percent(percent) freq(deaths)

. icd10 generate catcode=cause, category

. list cause death percent, sepby(catcode) noobs

cause deaths percent
J120 14 0.03
J121 18 0.04
J122 3 0.01
J128 3 0.01
J129 133 0.27
J13 277 0.56
J14 31 0.06
J150 88 0.18
J151 231 0.47
J152 625 1.26
J153 2 0.00
J154 189 0.38
J155 25 0.05
J156 24 0.05
J157 24 0.05
J158 39 0.08
J159 1264 2.55
J180 955 1.92
J181 1775 3.57
J182 64 0.13
J188 12 0.02
J189 43864 88.33

There are several codes with few deaths, so we'll collapse them all into a single “All other pneumonia diagnoses” category and have icd10 generate add WHO's description of the code to our dataset for the others. Let's keep all codes with at least 1% of the total.

. icd10 generate descr = cause if percent >= 1, description addcode(begin) version(2009) 

. replace descr = "All other pneumonia diagnoses" if percent < 1
(17 real changes made)

Note that with this icd10 generate, we needed to specify version(2009) to ensure icd10 added the descriptions from 2009. We need one more trick to make our graph. We're going to contract our data again, this time by our newly-created description variable, and then indicate whether the description is grouped or not.

. contract descr [fw=deaths], freq(tdeaths) percent(tpercent)

. generate grouped = descr=="All other pneumonia diagnoses"

. graph hbar (asis) tpercent, over(descr, sort(tpercent) descending 
     axis(outergap(-5))) over(grouped, axis(off)) nofill
     blabel(bar, format(%3.1f)) 
     title("2010 US Mortality", span) 
     subtitle("Top 5 Pneumonia Diagnoses", span) name(pctg)

Finally, we have our pneumonia mortality graph, showing nicely-formatted ICD-10 codes, labeled with the official WHO descriptions.

The ICD-10 codes used in Stata are copyrighted to WHO. The copyright information can be found in the ICD-10 copyright notification.

Tell me more

You can read more about ICD coding, including tips for working with records with multiple diagnosis codes, in the Introduction to ICD commands.

Also, see worked examples for the individual coding systems

For more about the other commands used above, see