»  Home »  Products »  Features »  Commands for working with ICD codes

## Commands for working with ICD codes

### Highlights

• Designed for use with
• The US National Center for Health Statistics (NCHS) ICD-10-CM diagnosis codes for healthcare encounter and claims data
• The US Centers for Medicare and Medicaid Services (CMS) ICD-10-PCS procedure codes for healthcare claims data
• The World Health Organization's ICD-10 codes for morbidity and mortality reporting
• NCHS ICD-9-CM diagnosis codes for healthcare encounter and claims data
• CMS ICD-9-CM procedure codes for healthcare claims data
• Suite of data-management commands lets you
• Easily generate new variables based on codes
• Indicators for different conditions
• Short descriptions
• Category codes from billable codes
• And more
• Verify that a variable contains valid codes and flag invalid codes
• Standardize the format of codes
• Interactive utilities let you
• Look up descriptions for codes
• Search for codes from keywords
• ICD-10 and ICD-10-CM/PCS commands let you indicate the version of the codes in your dataset

Information about diagnoses and procedures in administrative healthcare data is often encoded using one of the ICD coding systems. For example, the standard system for mortality reporting has been the World Health Organization's ICD-10 system since 1999. Since October of 2015, the U.S. has used ICD-10-CM to encode diagnoses and ICD-10-PCS to encode procedures.

When administrative data are gathered from multiple sources, the format of the codes may not be fully standardized. Thus, there may be reporting errors. Finally, the sheer number of codes available in these encoding systems means that analyzing the data in a meaningful way is often impossible without summarizing information.

Stata has a suite of commands for working with ICD codes, known collectively as the icd commands. Whether you want to add text to codes or create indicator variables, want to verify that the codes in your data are valid, or are using the codes as a step in a larger project, the icd commands provide valuable tools for reporting and research.

### Show me

Suppose we are conducting a study of mortality in the United States in 2010. We have vital statistics data from the CDC that contain records on more than 2.4 million deaths.

. use female agerc cause place using vital10.dta, clear
(US mortality data, 2010 -- CDC Vital Statistics)

. describe

Contains data from vital10.dta
obs:     2,472,542                          US mortality data, 2010 -- CDC
Vital Statistics
vars:             4                          31 Mar 2015 13:46
size:    32,143,046

storage   display    value
variable name   type    format     label      variable label

female          float   %9.0g      female     Decedent is female, female=1,
male=0
place           byte    %8.0g      pod        Place of death and status
cause           str4    %9s                   Cause of death (ICD-10 code)
agerc           float   %14.0g     agerc      Age, Census recode

Sorted by:


We want to identify all deaths that are due to respiratory illnesses. Any of 275 codes can currently be used to define a respiratory illness, far more than we would ever want to type! A plausible alternative is to use a lookup table, but definitions are often provided in terms of a range of codes, leaving you to type the codes at least once to create the lookup table anyway.

Because the CDC reported mortality using ICD-10 codes in 2010, we can use the icd10 commands to make our work easier.

We might want to start by verifying that all of the codes in our data are indeed valid codes and use the same format for storage. The default version for icd10 is codes from 2016, but we need to make sure we specify the version that applies for our data.

We start by using icd10 check with version(2010).

. icd10 check cause, version(2010)
(cause contains no missing values)

cause contains undefined codes:

1.  Invalid placement of period                       0
2.  Too many periods                                  0
3.  Code too short                                    0
4.  Code too long                                     0
5.  Invalid 1st char (not A-Z)                        0
6.  Invalid 2nd char (not 0-9)                        0
7.  Invalid 3rd char (not 0-9)                        0
8.  Invalid 4th char (not 0-9)                        0
77.  Valid only for previous versions             15,177
88.  Valid only for later versions                     0
99.  Code not defined                                  0
___________
Total                                        15,177



However, we discover that more than 15,000 records are using codes from a previous year. Out of 2.4 million, that isn't such a bad error rate, but if we used 2009, could we do better? Let's specify version(2009) and get a list of the codes with any problems. We'll also create a variable that indicates the type of problem that icd10 check finds.

. icd10 check cause, version(2009) generate(problem09) summary
(cause contains no missing values)

cause contains undefined codes:

1.  Invalid placement of period                       0
2.  Too many periods                                  0
3.  Code too short                                    0
4.  Code too long                                     0
5.  Invalid 1st char (not A-Z)                        0
6.  Invalid 2nd char (not 0-9)                        0
7.  Invalid 3rd char (not 0-9)                        0
8.  Invalid 4th char (not 0-9)                        0
77.  Valid only for previous versions                  0
88.  Valid only for later versions                 2,188
99.  Code not defined                                  0
___________
Total                                         2,188

Summary of invalid and undefined codes

cause   Count   Problem

A099    2135   Valid only for later versions
R636      43   Valid only for later versions
R263      10   Valid only for later versions



We think all respiratory diagnoses fall in the range of J10 to J98.9, but we might want to look up these three codes real quick just to set our minds at ease.

. icd10 lookup A09.9 R63.6 R26.3

A09.9 Gastroenteritis and colitis of unspecified origin
R26.3 Immobility
R63.6 Insufficient intake of food and water due to self neglect



We're fairly confident that we can ignore these codes. They don't apply to our study of respiratory illness. We'll use 2009 as our reference year for the rest of our analysis. To generate our respiratory cause of death indicator (resp), we type

. icd10 generate resp = cause, range(J10/J989)


We may wish to further examine deaths from pneumonia. For example, if we want to add an indicator for a pneumonia cause of death only to those decedents that we already know have a respiratory diagnosis, we can type

. icd10 generate pneumonia = cause if resp==1, range(J12/J189)

. tabulate pneumonia

pneumonia        Freq.     Percent        Cum.

0      187,594       79.07       79.07
1       49,660       20.93      100.00

Total      237,254      100.00



We see that about 21% of all deaths from respiratory illnesses in the US in 2010 were from pneumonia.

Now suppose we were giving a presentation and wanted to show a graph of common pneumonia diagnoses among decedents with a pneumonia-related cause of death. For this, we'll want to combine the icd10 generate command with a few other data management commands and Stata's graph hbar command.

First let's speed things up a bit by using contract to make a dataset of frequencies. Because we only want to graph the relative frequency of each pneumonia code within all pneumonia diagnoses, we can discard the other records. We'll also extract the category code from each cause of death.

. contract cause if pneumonia == 1, percent(percent) freq(deaths)

. icd10 generate catcode=cause, category

. list cause death percent, sepby(catcode) noobs

cause   deaths   percent

J120       14      0.03
J121       18      0.04
J122        3      0.01
J128        3      0.01
J129      133      0.27

J13       277      0.56

J14        31      0.06

J150       88      0.18
J151      231      0.47
J152      625      1.26
J153        2      0.00
J154      189      0.38
J155       25      0.05
J156       24      0.05
J157       24      0.05
J158       39      0.08
J159     1264      2.55

J180      955      1.92
J181     1775      3.57
J182       64      0.13
J188       12      0.02
J189    43864     88.33



There are several codes with few deaths, so we'll collapse them all into a single “All other pneumonia diagnoses” category and have icd10 generate add WHO's description of the code to our dataset for the others. Let's keep all codes with at least 1% of the total.

. icd10 generate descr = cause if percent >= 1, description addcode(begin) version(2009)

. replace descr = "All other pneumonia diagnoses" if percent < 1


Note that with this icd10 generate, we needed to specify version(2009) to ensure icd10 added the descriptions from 2009. We need one more trick to make our graph. We're going to contract our data again, this time by our newly-created description variable, and then indicate whether the description is grouped or not.

. contract descr [fw=deaths], freq(tdeaths) percent(tpercent)

. generate grouped = descr=="All other pneumonia diagnoses"

. graph hbar (asis) tpercent, over(descr, sort(tpercent) descending
axis(outergap(-5))) over(grouped, axis(off)) nofill
blabel(bar, format(%3.1f))
title("2010 US Mortality", span)
subtitle("Top 5 Pneumonia Diagnoses", span) name(pctg)


Finally, we have our pneumonia mortality graph, showing nicely-formatted ICD-10 codes, labeled with the official WHO descriptions.

## Show me more

You can read more about ICD coding, including tips for working with records with multiple diagnosis codes, in the Introduction to ICD commands.

Also, see worked examples for the individual coding systems

For more about the other commands used above, see