Home  /  Resources & Support  /  Introduction to Stata basics  /  Simple linear regression

Linear regression is a popular tool used to quantify the relationship between a continuous outcome variable and one or more predictor variables. The outcome variable is often called the “dependent” variable, and the predictor variables are often called “independent” variables. The predictor variables can be binary, categorical, or continuous. Here we will learn how to use Stata's regress command to fit simple linear regression models, and we will explore more sophisticated features later.

Let's begin by opening the nhanes2l dataset.

. webuse nhanes2l
(Second National Health and Nutrition Examination Survey)

Simple linear regression is often used to explore the linear relationship between two continuous variables. It can be useful to create a scatterplot to examine the relationship between the variables before fitting a linear regression model. Let's create a scatterplot for body mass index (bmi) and age.

. twoway scatter bmi age

We will use linear regression to identify a straight line that best describes the cloud of points in the scatterplot. “Best describes” is defined to be the line that has the smallest sum of squared vertical differences between each observation and the line. The line is defined by the equation below, and we will use our data to estimate the intercept and slope.

bmi = intercept + slope*age

Let's estimate the intercept and slope by typing regress followed by the outcome variable bmi followed by the predictor variable age.

. regress bmi age

   Number of obs   =    10,351
Source SS df MS
F(1, 10349) = 312.45
Model 7327.24496 1 7327.24496 Prob > F = 0.0000
Residual 242696.917 10,349 23.4512433 R-squared = 0.0293
Adj R-squared = 0.0292
Total 250024.162 10,350 24.1569239 Root MSE = 4.8426
bmi Coefficient Std. err. t P>|t| [95% conf. interval]
age .0488762 .0027651 17.68 0.000 .0434561 .0542963
_cons 23.21209 .1399079 165.91 0.000 22.93784 23.48633

The intercept is the coefficient labeled “_cons”, and the slope is the coefficient labeled “age”.

Let's use generate to create a new variable named bmi_predicted, which defines the regression line using the intercept and slope from our output.

. generate double bmi_predicted = 23.21209 + 0.0488762 * age

Next let's use twoway line to plot our regression line with our scatterplot.

. twoway (scatter bmi age) (line bmi_predicted age)

Our regression line has a slight upward slope, which also suggests that larger values of age tend to be associated with larger values of bmi. The slope tells us that each additional year of age is associated with 0.049 more bmi. In the regression table, the p-value for the estimated age coefficient is labeled “P>|t|” and tests the null hypothesis that the coefficient equals zero (no relationship). The p-value equals 0.000, which suggests that our result is inconsistent with the null hypothesis that there is no relationship between age and bmi.

That is the basic idea of linear regression. But there are many other numbers in the regression output table, and I've also used the term “null hypothesis” without defining it. Let's dig a little deeper to learn what it all means.

Digging deeper

First, let's define our null and alternative hypotheses. Our null hypothesis is that there is no relationship between age and bmi and our regression line is flat. In other words, our best prediction of bmi is simply the mean of bmi for any value of age. We use egen to create variable bmi_mean containing the mean of bmi, and then we use twoway line to draw a flat line for the mean of bmi.

. egen double bmi_mean = mean(bmi)

. twoway (scatter bmi age) (line bmi_predicted age) (line bmi_mean age)

Our alternative hypothesis is that there is a relationship between age and bmi and our regression line is not flat, which is what the red line in our graph shows.

We might ask ourselves, “is the red line really different from the green line?” or “is the slope of our regression line large enough to describe a meaningful relationship between age and bmi, or is our regression line basically flat?”. We used a p-value to answer that question above, but where did the p-value come from?

We can answer this question using the output from our regress command. The output contains a lot of numbers bundled into three groups. The first group at the top left is often called the “ANOVA table”, where ANOVA stands for “ANalysis Of VAriance”. The second group at the top right displays information about the entire model. And the third group at the bottom displays information about the relationship of each predictor variable with the outcome variable.

The ANOVA table

The ANOVA table includes columns labeled “SS”, “df”, and “MS”, which stand for “sum of squares”, “degrees of freedom”, and “mean squares”, respectively. The first column, labeled “Source”, includes rows labeled “Model”, “Residual”, and “Total”. Let's calculate the numbers in the table manually to develop our intuition.

The “SS” in the “Total” row equals 250,024.162, and it is called the “total sum of squares”. It is calculated by adding up the squared vertical distances between each blue dot and the green line in our graph. It quantifies how far the observed values of bmi deviate from the predicted bmi assuming the null hypothesis is true.

. generate double ss_total = (bmi - bmi_mean)^2

. table, statistic(total ss_total) nformat(%16.3f  total)

Total 250024.162

Our calculation is off slightly because of a rounding error. Stata's regress command uses much more accurate algorithms than our simple calculation. But the idea is the same.

Next let's calculate the value of “SS” in the “Residual” row, which equals 242,696.917. It is calculated by adding up the squared vertical distances between each blue dot and the red line in our graph. It quantifies how far the observed values of bmi deviate from the predicted values of bmi assuming the alternative hypothesis is true.

. generate double ss_residual = (bmi - bmi_predicted)^2

. table, statistic(total ss_residual) nformat(%16.3f  total)

Total 242696.917

Finally, let's calculate the value of “SS” in the “Model” row, which equals 7,327.24. It is calculated by adding up the squared vertical distances between the red line and the green line for each blue dot in our graph. It quantifies how far the predicted values of bmi deviate from each other assuming the alternative hypothesis is true and assuming the null hypothesis is true, respectively.

. generate double ss_model = (bmi_predicted - bmi_mean)^2

. table, statistic(total ss_model) nformat(%16.3f  total)

Total 7327.244

Note that the sum of ss_model and ss_residual always equals ss_total.

. table, statistic(total ss_model ss_residual ss_total) nformat(%16.3f  total)

ss_model 7327.244
ss_residual 242696.917
ss_total 250024.162

The df column in the ANOVA table displays the degrees of freedom for each source. The Model df equals the number of estimated parameters, k, minus one. We estimated an intercept and a slope for age, so k - 1 = 2 - 1 = 1. The Residual df equals the sample size, n, minus the number of estimated parameters, k. Our sample size is 10,351, and we estimated two parameters, so n - k = 10,351 - 2 = 10,349. The Total df equals the sample size, n, minus one. In our example, n - 1 = 10,351 - 1 = 10,350.

The MS column displays the “mean sum of squares” for each source. The MS equals the value of SS divided by df. We can verify this using display.

. display "Model MS = " 7327.24496 / 1
Model MS = 7327.245

. display "Residual MS = " 242696.917 / 10349
Residual MS = 23.451243

. display "Total MS = " 250024.162 / 10350
Total MS = 24.156924

There is some rounding error when we do the calculations by hand, but the numbers are basically the same as the output from regress.

The model statistics

The second group of numbers at the top right of the regression output displays information about the model. The “Number of obs.” tells us that Stata used 10,351 observations to fit the model.

The next row, labeled “F(1, 10349)”, is the F statistic. The F statistic is the ratio of the MS Model and the MS Residual, which, in our example, is 7327.24 / 23.45 = 312.45. The numbers in parentheses are the numerator and denominator degrees of freedom, respectively, also shown in the ANOVA table. The value of the F statistic does not have a direct interpretation, but we can use it to calculate the p-value labeled “Prob > F”. The F statistic has a central F distribution, and we can calculate the p-value manually using Stata's Ftail() function.

. display "Prob > F = " Ftail(1, 10349, 312.45)
Prob > F = 6.547e-69

The result, 6.547e-69, is scientific notation for a very small number that is essentially zero. The p-value is used to test the null hypothesis that all slope coefficients (but not the intercept) in the model are simultaneously equal to zero. Our small p-value suggests that our result is inconsistent with the null hypothesis that the age coefficient is zero. Another interpretation is that our model fits the data better than a model that includes only the intercept.

We often hear that the residuals in a regression model must be normally distributed. Statistical theory tells us that the square of a normally distributed variable with a mean of zero has a central \(\chi^2\) distribution. Statistical theory also tells us that the ratio of two independent \(\chi^2\) divided by their respective degrees of freedom has an F distribution. We can square normally distributed residuals to get sums of squares, divide the sums of squares by their degrees of freedom to get mean squares, and take the ratio of the mean squares to get an F statistic. The “normal residuals” assumption allows us to use an F distribution to calculate the p-value for our F statistic.

The next number in the table is labeled “R-squared”. The \(R^2\) is the ratio of the Model sum of squares and the Total sum of squares. We can use display to calculate the \(R^2\) manually.

. display "R-squared = " 7327.24496 / 250024.162
R-squared = .02930615

The \(R^2\) is often interpreted as the proportion of variability in the outcome variable that is explained by the model. Using this interpretation, our model explains 2.9% of the variability of bmi.

The next number, labeled “Adj R-squared”, is similar to the \(R^2\) but it is penalized, or reduced, to account for variables in the model that may be unnecessary.

The next number, labeled “Root MSE”, is the square root of the Residual MS or “mean squared error”. Let's verify this manually.

. display "Root MSE = " sqrt(23.4512433)
Root MSE = 4.8426484

The Root MSE tells us that the observed values of bmi deviate from our regression line by an average of 4.84 points.

The regression statistics

The third group of numbers at the bottom of the regression output displays information about the relationship between each predictor variable and the outcome variable. The row labeled “_cons” tells us about the intercept, but we are usually focused on the slope coefficients.

The column labeled “Coefficient” displays the estimated slope for each variable in the model. The slope coefficient for age equals 0.0488762, and as noted above, we interpret this to mean that each additional year of age is associated with 0.049 more bmi.

Stata also estimates the standard error for the coefficient, and it is displayed in the column labeled “Std. err.”. Different methods can be used to estimate standard errors, and we will discuss them later. Stata uses ordinary least squares to estimate standard errors by default.

The next column labeled “t” is the t statistic, which is used to test the null hypothesis that the coefficient equals zero. The t statistic is the difference between the estimated coefficient and the value of the coefficient assuming the null hypothesis (zero) divided by the standard error. We usually just divide the estimated coefficient by the standard error because the null value is always zero.

. display "t = " (0.0488762 - 0) / .0027651
t = 17.676106

. display "t = " 0.0488762  / .0027651
t = 17.676106

The next column, labeled “P>|t|”, is the p-value for the t statistic. We can use Stata's ttail() function to calculate the p-value. The first argument in the ttail() function is the degrees of freedom, which is n - 1 = 10,351 - 1 = 10,350. The second argument is the absolute value of the t statistic. We can use the abs() function to calculate absolute value. We must multiply the result by two because we are calculating a two-sided p-value.

. display "P>|t| = " 2*ttail(10350, abs(17.68))
P>|t| = 6.136e-69

The result is, again, scientific notation for a very small number that is effectively zero. This suggests that we should reject the null hypothesis that the coefficient equals zero.

The next two columns, labeled “[95% conf. interval]”, are the lower and upper bounds of the 95% confidence interval. Let's calculate them manually to develop our intuition. We will need the “critical value” for a t distribution that corresponds to a two-sided hypothesis test with an alpha level of 0.05. We can use Stata's invt() function to calculate this number. The first argument is the degrees of freedom, which is n - 1 = 10,351 - 1 = 10,350. The second argument is 1 - alpha/2 = 1 - 0.05/2 = 0.975. We multiply this number by the standard error and subtract it from the coefficient to estimate the lower bound.

. display "Lower 95% CI bound = " .0488762 - invt(10350, 0.975) * .0027651
Lower 95% CI bound = .04345607

We multiply the critical value by the standard error and add it to the coefficient to estimate the upper bound.

. display "Upper 95% CI bound = " .0488762 + invt(10350, 0.975) * .0027651
Upper 95% CI bound = .05429633

It is tempting to interpret the 95% confidence interval using the word “confidence” but that is not correct. The interval simply tells us that intervals estimated from many samples would include the true population parameter 95% of the time.

Let's interpret the results of our regression model now that we understand the numbers in our output and how to calculate them.

Interpreting the model

Our model examined the relationship between the outcome variable bmi and the predictor variable age. The model included 10,351 observations, explained roughly 3% of the variability of bmi, and fits the data better than a model that includes only the intercept (F(1, 10349) = 312.45, p < 0.000). The results suggest that each additional year of age is associated with 0.049 more bmi (t = 17.68, df = 10,350, p < 0.000).

There are many other options that you can use with regress, and you can read about them in the manual. You can also watch a demonstration of these commands by clicking on the link to the YouTube video below.

See it in action

Watch Simple linear regression in Stata.