Home  /  Resources & Support  /  Introduction to Stata basics  /  How to convert categorical string data to numeric data

Sometimes, categorical data are stored as strings. For example, the variable race may be stored with the words "Black", "Other", and "White". We will need to convert these variables to numeric data before we can use them with Stata's statistical features.

Let's begin by opening and describing an example dataset from the Stata website.

. use https://www.stata.com/users/youtube/rawdata.dta, clear
(Fictitious data based on the National Health and Nutrition Examination Survey)

. describe

Contains data from https://www.stata.com/users/youtube/rawdata.dta
 Observations:         1,268                  Fictitious data based on the
                                                National Health and Nutrition
                                                Examination Survey
    Variables:            10                  6 Jul 2016 11:17
                                              (_dta has notes)
Variable Storage Display Value
name type format label Variable label
id str6 %9s Identification Number age byte %9.0g sex byte %9.0g Sex race str5 %9s Race height float %9.0g height (cm) weight float %9.0g weight (kg) sbp int %9.0g Systolic blood pressure (mm/Hg) dbp int %9.0g Diastolic blood pressure (mm/Hg) chol str3 %9s serum cholesterol (mg/dL) dob str18 %18s
Sorted by: id

The storage type for the variable race is a 5-character string. Let's tabulate race to view the categories.

. tabulate race

Race Freq. Percent Cum.
Black 176 13.88 13.88
Other 22 1.74 15.62
White 1,070 84.38 100.00
Total 1,268 100.00

There are three categories stored as the strings: Black, Other, and White. We can use Stata's encode command to generate a new variable named racen.

. encode race, gen(racen)

Let's type tabulate race racen to view a cross-tabulation of the two variables and list race racen in 1/5 to view some raw data.

. tabulate race racen

. list race racen in 1/5

race racen
1. White White
2. White White
3. White White
4. Black Black
5. White White

The two variables appear to be identical. Next let's describe both variables.

. describe race racen

Variable Storage Display Value
name type format label Variable label
race str5 %9s Race racen long %8.0g racen Race

The storage type for race is "str5", and the storage type for racen is "long", which is a type of numeric variable. You can type help data_types to learn more about different types of numeric data. Notice that the Value label for racen is "racen". Let's type label list racen to view the labels.

. label list racen
racen:
           1 Black
           2 Other
           3 White

The variable racen is a numeric variable where 1 represents Black, 2 represents Other, and 3 represents White. This will allow us to use racen with Stata's statistical features such as regression modeling.

Note that there is a decode command that will do the reverse of encode: it will convert labeled numeric categorical variables to string variables.

. decode racen, gen(races)

We can use describe and list to verify that it worked.

. describe race racen races

. list race racen races in 1/5

race racen races
1. White White White
2. White White White
3. White White White
4. Black Black Black
5. White White White

The raw data look the same for all three variables, but, as we have learned, the storage type is important. And now we know how to convert between types when necessary.

You can watch a demonstration of these commands by clicking on the link to the YouTube video below. You can read more about these commands by clicking on the links to the Stata manual entries below.