Home  /  Resources & support  /  FAQs  /  Converting string ICD-9 codes to numeric

This FAQ is an expanded version of an earlier reply to a question on Statalist. This question, or a some variation of it, is one of the most common ICD-related topics that we see in Technical Support.

How do I convert my ICD-9 codes from a string type to a numeric type?

Title   Converting string ICD-9 codes to numeric
Author Rebecca Pope, StataCorp

The short answer is

you can use the destring command or the real() function. But, in most cases, it is a bad idea to convert codes from string to numeric because you can lose information. For details, see the long answer below.

If you are asking this question because you want to check for a range of codes, see

How do I check a variable for a range of diagnosis or procedure codes?

Or, if you want to label the codes with their descriptions, see

How do I label my diagnosis or procedure codes with their descriptions?

Note that you will not save space by storing full ICD-9 codes as numeric variables. Because you will likely use a dot with a numeric variable, Stata will store these values as doubles to retain the correct numeric precision, requiring 8 bytes of storage space. By contrast, storing a string variable with an implied dot needs only 5 characters, or 5 bytes.

Long answer

Mechanically, the answer to this question is simple. To convert from string to numeric data in Stata, you have your choice between

. destring stringvar, generate(numvar)


. generate numvar = real(stringvar)

Which you choose is mostly a matter of personal preference. destring returns an error if it encounters nonnumeric values. generate with the real() function bypasses the error, returning missing values instead.

However, ICD-9 and ICD-9-CM codes are not ordinary numerical strings. ICD-9 codes are diagnosis codes that are of the form ###, ###.#, E###, or E###.#, where # indicates a digit from 0 to 9. In the U.S., ICD-9 is commonly used as a shorthand for referring to ICD-9-CM codes. These are diagnosis codes of the form ###, ###.#, ###.##, E###, E###.#, V##, V##.#, or V##.## and procedure codes of the form ##, ##.#, or ##.##.

Obviously, the codes that start with E and V have no numeric representation. We can see this interactively by typing

. display real("E800.1")

You have to confront the issue of leading and trailing zeros even if you do not have any of these E-codes (external cause of injury) and V-codes (nondisease encounter with the system) in your data.

Consider the following data:

        dx                          descr 
    1.	25013      dmi ketoacd uncontrold 
    2.	0389               septicemia nos 
    3.	3890     conductive hearing loss* 
    4.	0030         salmonella enteritis 
    5.	030                      leprosy* 
    6.  410     acute myocardial infarct* 
    7.  4100      ami anterolateral wall* 
    8.  41000    ami anterolateral,unspec 

Preserving information by including dots

Many administrative systems store codes with what is called an "implied dot", meaning that the code 250.13, for example, is stored as "25013". Whether a dot is present matters not at all when you convert between string and numeric for a code like 250.13. However, it matters quite a bit for a code like 038.9, because if you convert your string codes to a numeric storage type, they will not retain the leading zeros.

To see this, create two new variables called num_dx and num_dot. num_dx is a straight numeric conversion of the string in dx.

. destring dx, generate(num_dx)

Now, create a version of dx that includes the dot, and then destring that variable.

. clonevar dx_dot = dx
. icd9 clean dx_dot, dots
. destring dx_dot, generate(num_dot)

	dx      num_dx   dx_dot   num_dot 
    1.	25013    25013   250.13    250.13 
    2.	0389       389   038.9       38.9 
    3.	3890      3890   389.0        389 
    4.	0030        30   003.0          3 
    5.	030         30   030           30 

Looking at rows 4 and 5, we see that the leading zeros are stripped from the string during the string-to-numeric conversion of dx, meaning that we lost the distinction between a 3-digit and a 4-digit code. Thus, when the dot is omitted, 003.0 and 030 both map to the numeric value 30. However, if we destring dx_dot instead, 003.0 maps to 3 and 030 maps to 30, not an altogether bad approximation of the values. It is tempting to conclude that as long as we only destring codes with dots, we are protected against errors. But this is not the case.

Losing information by including dots

Observe the results for the last three rows of the data:

	dx      num_dx   dx_dot   num_dot 
    6.	410        410   410          410 
    7.	4100      4100   410.0        410 
    8.	41000    41000   410.00       410 

In this case, num_dx, the variable created by converting our original string diagnosis, provides the best results. num_dot, the variable created by converting the string diagnosis code that includes a dot, results in all values being equal to 410. 410.00 is a billable code, the other two codes are not.

If you work with prevalidated data, you may never encounter this situation. However, if you are working with data that have not been cleaned by a third party to verify that all codes are complete (known as "terminal codes" in coding jargon), you are likely to encounter the problem of category codes and subcategory codes mapping to the same numeric value.

A hybrid solution

What about combining the two approaches? We could convert some codes from dot form and some from implied-dot form. Here is one way to do the prep work:

. icd9 clean dx
. clonevar dx_dot = dx
. icd9 clean dx_dot, dots
. generate toconvert = dx if substr(dx,-2,2)=="00"
. replace toconvert = dx_dot if missing(toconvert)
. destring toconvert, generate(hybrid)

	dx      hybrid 
    1.	25013   250.13 
    2.	0389      38.9 
    3.	3890       389 
    4.	0030         3 
    5.	030         30 
    6.  410        410 
    7.  4100      4100 
    8.  41000    41000 

All appears well until two patients show up, one with a neurotic disorder (perhaps the result of data management for ICD-9 codes) and one with lepromatus leprosy.

	dx     hybrid 
    9.	300       300 
    10.	0300      300 

The challenge of recovering original codes

The above examples are clearly contrived to highlight problems you might encounter. But how does this conversion affect real datasets? To get a sense of the scope of the problem, we can use the nhds2010.dta sample dataset. This dataset is an extract of all same-day discharges from U.S. hospitals for adults in 2010 taken from the National Hospital Discharge Survey (NHDS).

The diagnosis codes in the NHDS are prevalidated, so we do not have to worry about category codes appearing alongside terminal codes. In a way, this is an ideal case.

In our example dataset, we have 779 unique string diagnosis codes. We destring the variable containing the primary diagnosis code, formatted with a dot.

. icd9 clean dx1, dots
(2210 changes made)

. destring dx1, generate(num_dx1) force
dx1 contains nonnumeric characters; num_dx1 generated as double
(66 missing values generated)

We are left with 749 distinct codes, but this is because we lost 30 distinct V-codes in 66 records. So in this prevalidated dataset, we do not have the problem of multiple string codes mapping to the same numeric value.

What happens when we try to recover our string codes using tostring? Obviously, we lost the V-code information, but let's assume for now that it has been stored in another variable and focus on the string codes that have number values only. We can convert the numeric codes back to string using tostring.

. tostring num_dx1, generate(str_dx1) format(%06.2f)
str_dx1 generated as str6

. replace str_dx1="" if missing(num_dx1)
(66 real changes made)

Now, we can check how often these codes match the original.

. count if dx1 != str_dx1 & !missing(str_dx1)

We get back accurate information for less than half of our 2,210 observations.

In the original primary diagnosis variable (dx1), all codes are defined and valid. However, even though we maintained a unique mapping for our numeric values, we do not recover accurate codes.

. icd9 check str_dx1, generate(prob)
(str_dx1 contains 66 missing values)

str_dx1 contains undefined codes:

    1.  Invalid placement of period                       0
    2.  Too many periods                                  0
    3.  Code too short                                    0
    4.  Code too long                                     0
    5.  Invalid 1st char (not 0-9, E, or V)               0
    6.  Invalid 2nd char (not 0-9)			  0
    7.  Invalid 3rd char (not 0-9)			  0
    8.  Invalid 4th char (not 0-9)			  0
    9.  Invalid 5th char (not 0-9)			  0
   10.  Code not defined			      1,046
        Total					      1,046

The string-numeric-string conversion resulted in missing information for 66 observations. The conversion also resulted in 1,048 incorrect codes, of which 1,046 are undefined codes. However, two records have incorrect diagnoses that appear to be legitimate.

We can generate descriptions for the original and recovered ICD-9 code and then compare them. For our data, they are

        dx1                right_descr   str_dx1                wrong_descr 
174.    173.1   malig neo skin eyelid#    173.10   mal neo eyelid/canth nos 
175.    173.5    malig neo skin trunk#    173.50   malig neo skin trunk nos 

In this case, we've replaced the code that was correct and current in 2010 with one that did not go into effect until 2011 (you can find this information by typing icd9 query).

In conclusion, there is no "lossless" way to convert string ICD-9 codes to a numeric storage type.