Sometimes, data that look like numbers are actually stored as strings. We will need to convert these variables to numeric data before we can use them with Stata's statistical features.
Let's begin by opening an example dataset from the Stata website and listing the first five observations for the variable chol.
. use https://www.stata.com/users/youtube/rawdata.dta, clear (Fictitious data based on the National Health and Nutrition Examination Survey) . list chol in 1/5
chol | |
1. | 280 |
2. | 280 |
3. | 219 |
4. | 198 |
5. | 231 |
The data for chol appear to be numbers. Let's type summarize chol to estimate some descriptive statistics.
. summarize chol
Variable | Obs Mean Std. dev. Min Max | |
chol | 0 |
The output shows 0 observations, and the mean, standard deviation, minimum, and maximum are empty. This is our first clue that chol may be stored as a string variable. We can verify this by describing the data.
. describe chol Variable Storage Display Value name type format label Variable label
chol str3 %9s serum cholesterol (mg/dL) |
The Storage type for the variable chol is "str3". This means that chol is stored as a string variable that holds three characters. We can create a numeric variable named choln from chol using destring.
. destring chol, gen(choln) chol: all characters numeric; choln generated as int
Now type list chol choln in 1/5.
. list chol choln in 1/5
chol choln | |
1. | 280 280 |
2. | 280 280 |
3. | 219 219 |
4. | 198 198 |
5. | 231 231 |
The data look the same, but we can use descibe to verify that choln is stored as an "int" numeric variable. You can type help data_types to learn more about different types of numeric data.
. describe chol choln Variable Storage Display Value name type format label Variable label
chol str3 %9s serum cholesterol (mg/dL) |
choln int %10.0g serum cholesterol (mg/dL) |
We can also type summarize chol choln to verify that choln works with Stata's statistical features.
. summarize chol choln
Variable | Obs Mean Std. dev. Min Max | |
chol | 0 | |
choln | 1,268 216.5418 46.88068 89 426 |
Sometimes, numeric data include symbols such as "%" or "$". You can tell destring to ignore these symbols using the ignore() option. Note that there is a related command named tostring that converts numeric data to string data. Let's convert choln back to a string to see how it works.
. tostring choln, gen(chols) chols generated as str3
Now let's list and describe the three variables to check our work.
. list chol choln chols in 1/5 . describe chol choln chols Variable Storage Display Value name type format label Variable label
chol str3 %9s serum cholesterol (mg/dL) |
choln int %10.0g serum cholesterol (mg/dL) |
chols str3 %9s serum cholesterol (mg/dL) |
The raw data look the same for all three variables, but, as we have learned, the storage type is important. And now we know how to convert between types when necessary.
You can watch a demonstration of these commands by clicking on the link to the YouTube video below. You can read more about these commands by clicking on the links to the Stata manual entries below.
Watch Data management: How to convert a string variable to a numeric variable.
Read more in the Stata Data Management Reference Manual; see [D] describe, [D] destring, and [D] save. In the Stata Base Reference Manual, see [R] summarize.