
Title | How to add empty factor-variable levels to a table? | |
Authors |
Mia Lv, StataCorp Jeff Pitblado, StataCorp |
When describing a subgroup of your data with the dtable or table command in Stata, you may notice that some categories of your categorical variables are missing from the output if they do not appear in your sample—even though those categories exist in the full dataset. But you would like to include all the possible categories (including those with zero frequency) in the table so that your table dimension remains consistent across different samples.
In this FAQ, I will demonstrate how to manually add those categories using the collect suite of commands through four practical examples. It is worth mentioning that there is a separate FAQ discussing the general case when you want to add a custom item to an existing table. It is good to read that FAQ first because it will help you become familiar with the basics of table dimensions and using collect to modify table dimensions in Stata.
Below is a table generated with the dtable command:
Car origin |
Domestic Foreign Total |
N 22 (68.8%) 10 (31.2%) 32 (100.0%) |
Repair record 1978 |
2 4 (18.2%) 0 (0.0%) 4 (12.9%) |
3 12 (54.5%) 1 (11.1%) 13 (41.9%) |
4 6 (27.3%) 3 (33.3%) 9 (29.0%) |
5 0 (0.0%) 5 (55.6%) 5 (16.1%) |
And below is that same table after we add a new row:
Car origin |
Domestic Foreign Total |
N 22 (68.8%) 10 (31.2%) 32 (100.0%) |
Repair record 1978 |
1 0 (0.0%) 0 (0.0%) 0 (0.0%) |
2 4 (18.2%) 0 (0.0%) 4 (12.9%) |
3 12 (54.5%) 1 (11.1%) 13 (41.9%) |
4 6 (27.3%) 3 (33.3%) 9 (29.0%) |
5 0 (0.0%) 5 (55.6%) 5 (16.1%) |
Here is the code to create the table, including the code that adds the new row:
sysuse auto, clear dtable if price>5000 & trunk>10, factor(rep78) by(foreign) * Get the list of displayed rep78 categories collect levelsof rep78 local found `"`s(levels)'"' * Get the full list of all the rep78 categories levelsof rep78, local(all) * Obtain the list for the missing categories of rep78 by comparing these two lists local not_found: list all - found display "Empty levels for rep78 include: `not_found'" collect levelsof foreign local column_levels `"`s(levels)'"' foreach x of local not_found { foreach k of local column_levels{ * Add zero fvfrequency and zero fvpercent for each missing category of rep78 and each category of foreign collect get fvfrequency=0 fvpercent=0, tags(var[`x'.rep78] foreign[`k']) } } collect style autolevels result _dtable_stats, clear collect layout
This code creates a descriptive table of repair records (rep78) by car origin (foreign) for cars with price>5000 and trunk>10. In this code, we first store the rep78 categories found in the table and compare them with all possible rep78 values in the dataset to identify missing categories (1). For each missing repair record category, we add zero fvfrequency and zero fvfrequency for all the foreign categories (0,1) using collect get within a foreach loop.
Please note that fvfrequency represents the count (frequency) of observations for a given factor-variable level, while fvpercent indicates the percentage of observations in that level. For more details, see [R] dtable, which provides formal definitions and additional information about these statistics.
Below is a table generated with the dtable command:
Race |
White Black Total |
N 142 (57.7%) 104 (42.3%) 246 (100.0%) |
Married |
Single 34 (23.9%) 49 (47.1%) 83 (33.7%) |
Married 108 (76.1%) 55 (52.9%) 163 (66.3%) |
College graduate |
Not college grad 103 (72.5%) 85 (81.7%) 188 (76.4%) |
College grad 39 (27.5%) 19 (18.3%) 58 (23.6%) |
Age in current year 35.324 (0.669) 35.327 (0.630) 35.325 (0.651) |
Hourly wage 7.522 (5.793) 5.970 (4.843) 6.866 (5.456) |
Usual hours worked 38.408 (10.512) 38.126 (8.235) 38.290 (9.603) |
And below is that same table after we add a new column:
Race |
White Black Other Total |
N 142 (57.7%) 104 (42.3%) 246 (100.0%) |
Married |
Single 34 (23.9%) 49 (47.1%) 0 (0.0%) 83 (33.7%) |
Married 108 (76.1%) 55 (52.9%) 0 (0.0%) 163 (66.3%) |
College graduate |
Not college grad 103 (72.5%) 85 (81.7%) 0 (0.0%) 188 (76.4%) |
College grad 39 (27.5%) 19 (18.3%) 0 (0.0%) 58 (23.6%) |
Age in current year 35.324 (0.669) 35.327 (0.630) . (.) 35.325 (0.651) |
Hourly wage 7.522 (5.793) 5.970 (4.843) . (.) 6.866 (5.456) |
Usual hours worked 38.408 (10.512) 38.126 (8.235) . (.) 38.290 (9.603) |
Here is the code to create the table, including the code that adds the new column:
sysuse nlsw88, clear dtable if south==1 & age<37, by(race) factor(married collgrad) continuous(age wage hours) collect levelsof race local found `"`s(levels)'"' levelsof race, local(all) local not_found: list all - found display "Missing categories for race include: `not_found'" collect levelsof var local rows `"`s(levels)'"' * Separate continuous and categorical variables excluding _N and _hide local continuous_vars "" local categorical_vars "" foreach v of local rows { * Skip _N and _hide if "`v'" == "_N" | "`v'" == "_hide" { continue } * Check if this is a categorical variable (contains a dot) else if strpos("`v'", ".") > 0 { local categorical_vars `categorical_vars' `v' } else { local continuous_vars `continuous_vars' `v' } } display "Continuous variables: `continuous_vars'" display "Categorical variables: `categorical_vars'" foreach x of local not_found { * Add missing mean and sd for continuous variables foreach k of local continuous_vars { collect get mean=. sd=., tags(race[`x'] var[`k']) } * Add zero fvfrequency and fvpercent for categorical variables foreach k of local categorical_vars { collect get fvfrequency=0 fvpercent=0, tags(race[`x'] var[`k']) } } * Update the auto levels for race collect levelsof race collect style autolevels race `s(levels)', clear collect style autolevels result _dtable_stats, clear collect layout
This code creates a descriptive table for women in the nlsw88 dataset who live in the South and are under 37, broken down by race. It includes both categorical variables (married and collgrad) and continuous variables (age, wage, and hours).
In this code, we first identify any missing race categories in the filtered sample by comparing race values in the table against all possible race values in the dataset. We discover that race=3 is missing from the table. We then separate the levels in the collect dimension var (for rows) into two groups: continuous and categorical by checking for dots in their level names. Levels for categorical variables contain a dot (for example, "0.married"), while continuous variables do not. Special var levels like _N and _hide are explicitly excluded from this process. For each missing race category, we add missing values (.) for means and standard deviations of continuous variables, and zeros for frequencies and percentages of categorical variables.
Examples 3 and 4 follow a similar method.
Below is a table generated with the table command:
Frequency | ||
Repair record 1978 | ||
3 | 1 | |
4 | 6 | |
5 | 6 | |
Total | 13 | |
And below is that same table after we add two new rows:
Frequency | ||
Repair record 1978 | ||
1 | 0 | |
2 | 0 | |
3 | 1 | |
4 | 6 | |
5 | 6 | |
Total | 13 | |
Here is the code to create the table, including the code that adds two new rows:
sysuse auto, clear table (rep78) if price > 5000 & foreign == 1 collect levelsof rep78 local found `"`s(levels)'"' levelsof rep78, local(all) local not_found: list all - found display "Missing categories: `not_found'" foreach x of local not_found { collect get frequency=0, tags(rep78[`x']) } collect layout
Below is a table generated with the table command:
Race | ||
White Black Total | ||
College graduate | ||
Not college grad | ||
Mean | 6.92 4.95 6.03 | |
Standard deviation | 6.35 2.20 5.01 | |
College grad | ||
Mean | 9.11 10.53 9.57 | |
Standard deviation | 3.58 9.21 5.98 | |
Total | ||
Mean | 7.52 5.97 6.87 | |
Standard deviation | 5.79 4.84 5.46 | |
And below is that same table after we add a new column:
Race | ||
White Black Other Total | ||
College graduate | ||
Not college grad | ||
Mean | 6.92 4.95 . 6.03 | |
Standard deviation | 6.35 2.20 . 5.01 | |
College grad | ||
Mean | 9.11 10.53 . 9.57 | |
Standard deviation | 3.58 9.21 . 5.98 | |
Total | ||
Mean | 7.52 5.97 . 6.87 | |
Standard deviation | 5.79 4.84 . 5.46 | |
Here is the code to create the table, including the code that adds the new column:
sysuse nlsw88, clear table (collgrad) (race) if south==1 & age<37, statistic(mean wage) statistic(sd wage) nformat(%6.2f mean sd) collect levelsof race local found `"`s(levels)'"' levelsof race, local(all) local not_found: list all - found display "Missing categories for race include: `not_found'" collect levelsof collgrad local rows `"`s(levels)'"' foreach x of local not_found{ foreach k of local rows { collect get mean=. sd=., tags(race[`x'] collgrad[`k']) } } collect layout