Simple linear regression

Linear regression is a popular tool used to quantify the relationship between a continuous outcome variable and one or more predictor variables. The outcome variable is often called the “dependent” variable, and the predictor variables are often called “independent” variables. The predictor variables can be binary, categorical, or continuous. Here we will learn how to use Stata's regress command to fit simple linear regression models, and we will explore more sophisticated features later.

Let's begin by opening the nhanes2l dataset.

. webuse nhanes2l
(Second National Health and Nutrition Examination Survey)

Simple linear regression is often used to explore the linear relationship between two continuous variables. It can be useful to create a scatterplot to examine the relationship between the variables before fitting a linear regression model. Let's create a scatterplot for body mass index (bmi) and age.

. twoway scatter bmi age

We will use linear regression to identify a straight line that best describes the cloud of points in the scatterplot. “Best describes” is defined to be the line that has the smallest sum of squared vertical differences between each observation and the line. The line is defined by the equation below, and we will use our data to estimate the intercept and slope.

bmi = intercept + slope*age

Let's estimate the intercept and slope by typing regress followed by the outcome variable bmi followed by the predictor variable age.

. regress bmi age


      Source         SS           df       MS      Number of obs   =    10,351
      F(1, 10349)     =    312.45
       Model    7327.24496         1  7327.24496    Prob > F        =    0.0000
    Residual    242696.917    10,349  23.4512433    R-squared       =    0.0293
      Adj R-squared   =    0.0292
       Total    250024.162    10,350  24.1569239    Root MSE        =    4.8426




         bmi   Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
   
         age     .0488762   .0027651    17.68   0.000     .0434561    .0542963
       _cons     23.21209   .1399079   165.91   0.000     22.93784    23.48633

The intercept is the coefficient labeled _cons, and the slope is the coefficient labeled age.

Let's use generate to create a new variable named bmi_predicted, which defines the regression line using the intercept and slope from our output.

. generate double bmi_predicted = 23.21209 + 0.0488762 * age

Next let's use twoway line to plot our regression line with our scatterplot.

. twoway (scatter bmi age) (line bmi_predicted age)

Our regression line has a slight upward slope, which also suggests that larger values of age tend to be associated with larger values of bmi. The slope tells us that each additional year of age is associated with 0.049 more bmi. In the regression table, the p-value for the estimated age coefficient is labeled P>|t| and tests the null hypothesis that the coefficient equals 0 (no relationship). The p-value equals 0.000, which suggests that our result is inconsistent with the null hypothesis that there is no relationship between age and bmi.

That is the basic idea of linear regression. But there are many other numbers in the regression output table, and I've also used the term “null hypothesis” without defining it. Let's dig a little deeper to learn what it all means.

Digging deeper

First, let's define our null and alternative hypotheses. Our null hypothesis is that there is no relationship between age and bmi and our regression line is flat. In other words, our best prediction of bmi is simply the mean of bmi for any value of age. We use egen to create variable bmi_mean containing the mean of bmi, and then we use twoway line to draw a flat line for the mean of bmi.

. egen double bmi_mean = mean(bmi)

. twoway (scatter bmi age) (line bmi_predicted age) (line bmi_mean age)

Our alternative hypothesis is that there is a relationship between age and bmi and our regression line is not flat, which is what the red line in our graph shows.

We might ask ourselves, “is the red line really different from the green line?” or “is the slope of our regression line large enough to describe a meaningful relationship between age and bmi, or is our regression line basically flat?” We used a p-value to answer that question above, but where did the p-value come from?

We can answer this question using the output from our regress command. The output contains a lot of numbers bundled into three groups. The group at the top left is often called the ANOVA table, where ANOVA stands for “ANalysis Of VAriance”. The group at the top right displays information about the entire model. And the group at the bottom displays information about the relationship of each predictor variable with the outcome variable.

The ANOVA table

The ANOVA table includes columns labeled SS, df, and MS, which stand for “sum of squares”, “degrees of freedom”, and “mean squares”, respectively. The first column, labeled Source, includes rows labeled Model, Residual, and Total. Let's calculate the numbers in the table manually to develop our intuition.

The SS in the Total row equals 250,024.162, and it is called the “total sum of squares”. It is calculated by adding up the squared vertical distances between each blue dot and the green line in our graph. It quantifies how far the observed values of bmi deviate from the predicted bmi assuming the null hypothesis is true.

. generate double ss_total = (bmi - bmi_mean)^2

. table, statistic(total ss_total) nformat(%16.3f  total)



Total    250024.162

Our calculation is off slightly because of a rounding error. Stata's regress command uses much more accurate algorithms than our simple calculation. But the idea is the same.

Next let's calculate the value of SS in the Residual row, which equals 242,696.917. It is calculated by adding up the squared vertical distances between each blue dot and the red line in our graph. It quantifies how far the observed values of bmi deviate from the predicted values of bmi assuming the alternative hypothesis is true.

. generate double ss_residual = (bmi - bmi_predicted)^2

. table, statistic(total ss_residual) nformat(%16.3f  total)



Total    242696.917

Finally, let's calculate the value of SS in the Model row, which equals 7,327.24. It is calculated by adding up the squared vertical distances between the red line and the green line for each blue dot in our graph. It quantifies how far the predicted values of bmi deviate from each other assuming the alternative hypothesis is true and assuming the null hypothesis is true, respectively.

. generate double ss_model = (bmi_predicted - bmi_mean)^2

. table, statistic(total ss_model) nformat(%16.3f  total)



Total    7327.244

Note that the sum of ss_model and ss_residual always equals ss_total.

. table, statistic(total ss_model ss_residual ss_total) nformat(%16.3f  total)



ss_model         7327.244
ss_residual    242696.917
ss_total       250024.162

The df column in the ANOVA table displays the degrees of freedom for each source. The model df equals the number of estimated parameters, k, minus 1. We estimated an intercept and a slope for age, so k - 1 = 2 - 1 = 1. The residual df equals the sample size, n, minus the number of estimated parameters, k. Our sample size is 10,351, and we estimated two parameters, so n - k = 10,351 - 2 = 10,349. The total df equals the sample size, n, minus 1. In our example, n - 1 = 10,351 - 1 = 10,350.

The MS column displays the mean sum of squares for each source. MS equals SS divided by df. We can verify this using display.

. display "Model MS = " 7327.24496 / 1
Model MS = 7327.245

. display "Residual MS = " 242696.917 / 10349
Residual MS = 23.451243

. display "Total MS = " 250024.162 / 10350
Total MS = 24.156924

There is some rounding error when we do the calculations by hand, but the numbers are basically the same as the output from regress.

The model statistics

The group of numbers at the top right of the regression output displays information about the model. The Number of obs tells us that Stata used 10,351 observations to fit the model.

The next row, labeled F(1, 10349), is the F statistic. The F statistic is the ratio of the model MS and the residual MS, which, in our example, is 7327.24 / 23.45 = 312.45. The numbers in parentheses are the numerator and denominator degrees of freedom, respectively, also shown in the ANOVA table. The value of the F statistic does not have a direct interpretation, but we can use it to calculate the p-value labeled Prob > F. The F statistic has a central F distribution, and we can calculate the p-value manually using Stata's Ftail() function.

. display "Prob > F = " Ftail(1, 10349, 312.45)
Prob > F = 6.547e-69

The result, 6.547e-69, is scientific notation for a very small number that is essentially 0. The p-value is used to test the null hypothesis that all slope coefficients (but not the intercept) in the model are simultaneously equal to 0. Our small p-value suggests that our result is inconsistent with the null hypothesis that the age coefficient is 0. Another interpretation is that our model fits the data better than a model that includes only the intercept.

We often hear that the residuals in a regression model must be normally distributed. Statistical theory tells us that the square of a normally distributed variable with a mean of 0 has a central \(\chi^2\) distribution. Statistical theory also tells us that the ratio of two independent \(\chi^2\) divided by their respective degrees of freedom has an F distribution. We can square normally distributed residuals to get sums of squares, divide the sums of squares by their degrees of freedom to get mean squares, and take the ratio of the mean squares to get an F statistic. The “normal residuals” assumption allows us to use an F distribution to calculate the p-value for our F statistic.

The next number in the table is labeled R-squared. The \(R^2\) is the ratio of the model SS and the total SS. We can use display to calculate the \(R^2\) manually.

. display "R-squared = " 7327.24496 / 250024.162
R-squared = .02930615

The \(R^2\) is often interpreted as the proportion of variability in the outcome variable that is explained by the model. Using this interpretation, our model explains 2.9% of the variability of bmi.

The next number, labeled Adj R-squared, is similar to the \(R^2\) but is penalized, or reduced, to account for variables in the model that may be unnecessary.

The next number, labeled Root MSE, is the square root of the residual MS or “Mean Squared Error”. Let's verify this manually.

. display "Root MSE = " sqrt(23.4512433)
Root MSE = 4.8426484

The root MSE tells us that the observed values of bmi deviate from our regression line by an average of 4.84 points.

The regression statistics

The group of numbers at the bottom of the regression output displays information about the relationship between each predictor variable and the outcome variable. The row labeled _cons tells us about the intercept, but we are usually focused on the slope coefficients.

The column labeled Coefficient displays the estimated slope for each variable in the model. The slope coefficient for age equals 0.0488762 and, as noted above, we interpret this to mean that each additional year of age is associated with 0.049 more bmi.

Stata also estimates the standard error for the coefficient, and it is displayed in the column labeled Std. err. Different methods can be used to estimate standard errors, and we will discuss them later. By default, Stata uses ordinary least squares to estimate standard errors.

The column labeled t is the t statistic, which is used to test the null hypothesis that the coefficient equals 0. The t statistic is the difference between the estimated coefficient and the value of the coefficient assuming the null hypothesis (0) divided by the standard error. We usually just divide the estimated coefficient by the standard error because the null value is always 0.

. display "t = " (0.0488762 - 0) / .0027651
t = 17.676106

. display "t = " 0.0488762  / .0027651
t = 17.676106

The next column, labeled P>|t|, is the p-value for the t statistic. We can use Stata's ttail() function to calculate the p-value. The first argument in the ttail() function is the degrees of freedom, which is n - 1 = 10,351 - 1 = 10,350. The second argument is the absolute value of the t statistic. We can use the abs() function to calculate absolute value. We must multiply the result by two because we are calculating a two-sided p-value.

. display "P>|t| = " 2*ttail(10350, abs(17.68))
P>|t| = 6.136e-69

The result is, again, scientific notation for a very small number that is effectively 0. This suggests that we should reject the null hypothesis that the coefficient equals 0.

The last two columns, jointly labeled [95% conf. interval], are the lower and upper bounds of the 95% confidence interval. Let's calculate them manually to develop our intuition.

We will need the “critical value” for a t distribution that corresponds to a two-sided hypothesis test with an alpha level of 0.05. We can use Stata's invt() function to calculate the critical value. The first argument is the degrees of freedom, which is n - 1 = 10,351 - 1 = 10,350. The second argument is 1 - alpha/2 = 1 - 0.05/2 = 0.975. We multiply this number by the standard error and subtract it from the coefficient to estimate the lower bound.

. display "Lower 95% CI bound = " .0488762 - invt(10350, 0.975) * .0027651
Lower 95% CI bound = .04345607

We multiply the critical value by the standard error and add it to the coefficient to estimate the upper bound.

. display "Upper 95% CI bound = " .0488762 + invt(10350, 0.975) * .0027651
Upper 95% CI bound = .05429633

It is tempting to interpret the 95% confidence interval using the word “confidence” but that is not correct. The interval simply tells us that intervals estimated from many samples would include the true population parameter 95% of the time.

Let's interpret the results of our regression model now that we understand the numbers in our output and how to calculate them.

Interpreting the model

Our model examined the relationship between the outcome variable bmi and the predictor variable age. The model included 10,351 observations, explained roughly 3% of the variability of bmi, and fit the data better than a model that included only the intercept (F(1, 10349) = 312.45, p < 0.000). The results suggest that each additional year of age is associated with 0.049 more bmi (t = 17.68, df = 10,350, p < 0.000).

You can watch a demonstration of these commands by clicking on the link to the YouTube video below. You can read more about regress and the many other options available with regress by clicking on the links to the Stata manual entries below.

See it in action

Watch Simple linear regression in Stata.

Tell me more

Products

New in Stata 19

Why Stata

All features

Disciplines

Stata/MP

StataNow

Order Stata

Purchase

Order Stata

Bookstore

Stata Press

Stata Journal

Gift Shop

Learn

Free webinars

NetCourses

Classroom and web training

Organizational training

Video tutorials

Third-party courses

Web resources

Teaching with Stata

Support

Training

Video tutorials

FAQs

Statalist: The Stata Forum

Resources

Technical support

Customer service

Alerts

Company

News and events

Customer service

Careers

We use cookies

We use cookies to ensure that we give you the best experience on our website—to enhance site navigation, to analyze usage, and to assist in our marketing efforts. By continuing to use our site, you consent to the storing of cookies on your device and agree to delivery of content, including web fonts and JavaScript, from third party web services.

Cookie Settings

Privacy policy

Last updated: 16 November 2022

StataCorp LLC (StataCorp) strives to provide our users with exceptional products and services. To do so, we must collect personal information from you. This information is necessary to conduct business with our existing and potential customers. We collect and use this information only where we may legally do so. This policy explains what personal information we collect, how we use it, and what rights you have to that information.

Required cookies

Advertising cookies

Required cookies

These cookies are essential for our website to function and do not store any personally identifiable information. These cookies cannot be disabled.
Advertising and performance cookies

This website uses cookies to provide you with a better user experience. A cookie is a small piece of data our website stores on a site visitor's hard drive and accesses each time you visit so we can improve your access to our site, better understand how you use our site, and serve you content that may be of interest to you. For instance, we store a cookie when you log in to our shopping cart so that we can maintain your shopping cart should you not complete checkout. These cookies do not directly store your personal information, but they do support the ability to uniquely identify your internet browser and device.

Please note: Clearing your browser cookies at any time will undo preferences saved here. The option selected here will apply only to the device you are currently using.

Accept Cookies

Source	SS df MS	Number of obs = 10,351
		F(1, 10349) = 312.45
Model	7327.24496 1 7327.24496	Prob > F = 0.0000
Residual	242696.917 10,349 23.4512433	R-squared = 0.0293
		Adj R-squared = 0.0292
Total	250024.162 10,350 24.1569239	Root MSE = 4.8426


bmi		Coefficient Std. err. t P>\|t\| [95% conf. interval]

age		.0488762 .0027651 17.68 0.000 .0434561 .0542963
_cons		23.21209 .1399079 165.91 0.000 22.93784 23.48633