Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down at the end of May, and its replacement, is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: Regression diagnostics after -xtreg-

From   "Tobias Pfaff" <>
To   <>
Subject   st: Regression diagnostics after -xtreg-
Date   Wed, 7 Nov 2012 11:53:18 +0100

Dear all,

Some time ago I had posted a question on how to do regression diagnostics
after -xtreg-
( I never
received an answer and tried to come up with solutions myself.

In the meantime I have received a few private messages on how I eventually
proceeded. I post the code that I used below.
Beware that I cannot guarantee that the code is correct! I just want to post
it for discussion in case it helps anybody.



      Without verifying that your data have met the assumptions underlying
OLS regression, your results may be misleading.
      In particular, we will consider the following assumptions.

            > Linearity - the relationships between the predictors and the
outcome variable should be linear
            > Normality - the errors should be normally distributed -
technically normality is necessary only for hypothesis tests to be valid,
              estimation of the coefficients only requires that the errors
be identically and independently distributed
            > Homogeneity of variance (homoscedasticity) - the error
variance should be constant
            > Independence - the errors associated with one observation are
not correlated with the errors of any other observation (e.g.,
            > Errors in variables - predictor variables are measured without
error (we will cover this in Chapter 4)
            > Model specification - the model should be properly specified
(including all relevant variables, and excluding irrelevant variables)

      Additionally, there are issues that can arise during the analysis
that, while strictly speaking are not assumptions of regression,
      are none the less, of great concern to data analysts.

            > Influence - individual observations that exert undue influence
on the coefficients
            > Collinearity - predictors that are highly collinear, i.e.,
linearly related, can cause problems in estimating the regression


** 1. Unusual and influential data

      A single observation that is substantially different from all other
observations can make a large difference in the results
      of your regression analysis.  If a single observation (or small group
of observations) substantially changes your results,
      you would want to know about this and investigate further.  There are
three ways that an observation can be unusual.
      Outliers: In linear regression, an outlier is an observation with
large residual. In other words, it is an observation whose
      dependent-variable value is unusual given its values on the predictor
variables. An outlier may indicate a sample peculiarity
      or may indicate a data entry error or other problem.

      Leverage: An observation with an extreme value on a predictor variable
is called a point with high leverage. Leverage is a
      measure of how far an observation deviates from the mean of that
variable. These leverage points can have an effect on the
      estimate of regression coefficients.

      Influence: An observation is said to be influential if removing the
observation substantially changes the estimate of coefficients.
      Influence can be thought of as the product of leverage and

// Added-variable plots
display _n in yellow "Unusual and influential data: Added-variable plots"

foreach var in $indepvars {

      // Don't create plot for categorical variables
      if (substr("`var'",1,2) == "i.") continue
      quietly {
            local RHS = subinword("$indepvars", "`var'", "", 1)
            di _n in yellow "var: `var'"
            di _n in yellow "Before: $indepvars"
            di _n in yellow "After: `RHS'"
            di _n in yellow "depvar: $depvar"
            // Step 1: Regression of y on all independent variables without
            areg $depvar `RHS' [aw = phrf], absorb(pid) vce(cluster
            // Generate residuals
            predict res_1_`part' if e(sample), residuals
            // Step 2: Regression of `var' on all other independent
            areg `var' `RHS' [aw = phrf], absorb(pid) vce(cluster region)
            // Generate residuals
            predict res_2_`part' if e(sample), residuals
            // Regress both residuals on each other to obtain statistics
displayed in added-variable plot
            regress res_1_`part' res_2_`part'
            local coef = round(_b[res_2_`part'], .0001)
            local se = round(_se[res_2_`part'], .0001)
            local t = round(_b[res_2_`part']/_se[res_2_`part'],.01)
      twoway scatter res_1_`part' res_2_`part', name(g1, replace)
msize(tiny) legend(off) ///
                        ytitle("e(life_sat | X)") xtitle("e(`var' | X)")
note("coef = `coef', se = `se', t = `t'") || ///
               lfit res_1_`part' res_2_`part' //mlabel(pid)
      // Conversion of global to local since -graph export- cannot read
      local depvar = "$depvar"
      local model = "$model"
      local path = "$path"
      graph export
Graphs/`path'/avplot_m_`model'_`part'_`depvar'_`var'.png, replace
      drop res_*
      graph drop _all

      How added-variable plots should be interpreted...
      The horizontal variable in the added-variable plot is the residual
from the regression
      of the respective variable on all other independent variables (X).
Thus, data points
      going to the far left/right mean that the value of the respective
variable is unusually
      low/high given all other independent variables (X).
      Unusual data points can be marked by adding -mlabel(pid)- to the

** 2. Checking normality of residuals

      Many researchers believe that multiple regression requires normality.
This is not the case. Normality of residuals is
      only required for valid hypothesis testing, that is, the normality
assumption assures that the p-values for the t-tests
      and F-test will be valid. Normality is not required in order to obtain
unbiased estimates of the regression coefficients.
      OLS regression merely requires that the residuals (errors) be
identically and independently distributed. Furthermore, there
      is no assumption or requirement that the predictor variables be
normally distributed. If this were the case than we would
      not be able to use dummy coded variables in our models.
      See also:

// Graphical analysis
kdensity residuals, normal name(g1, replace) graph box residuals, name(g2,
replace) symplot residuals, name(g3, replace) qnorm residuals, name(g4,
replace) pnorm residuals, name(g5, replace)
      The pnorm command graphs a standardized normal probability (P-P) plot
while qnorm plots the quantiles of a variable
      against the quantiles of a normal distribution. pnorm is sensitive to
non-normality in the middle range of data and
      qnorm is sensitive to non-normality near the tails.

graph combine g1 g2 g3 g4 g5, name(gcomb2, replace)

// Conversion of global to local since -graph export- cannot read globals
local depvar = "$depvar" local model = "$model"
local path = "$path"
graph export
Graphs/`path'/normality_residuals_m_`model'_`part'_`depvar'.png, replace
graph drop _all

// Formal tests
iqr residuals
jb residuals
sktest residuals //swilk can only be used with less than 2000 obs. and
sfrancia with less than 5000 obs.

      iqr stands for inter-quartile range and assumes the symmetry of the
distribution. Severe outliers consist of
      those points that are either 3 inter-quartile-ranges below the first
quartile or 3 inter-quartile-ranges
      above the third quartile. The presence of any severe outliers should
be sufficient evidence to reject normality
      at a 5% significance level. Mild outliers are common in samples of any
      The p-value of the sktest is based on the assumption that the
distribution is normal. For example, a p-value of .15
      indicates that we cannot reject that the residuals are normally
distributed at the 5 percent level.

// Test for normality with pantest2
//pantest2 svyyear

** 3. Checking homoscedasticity

// residual vs. fitted plot
display _n in yellow "Homoscedasticity: Residual-vs-fitted-plot"
scatter residuals yhat, name(g1) msize(tiny) || mband residuals yhat,
bands(50) clp(solid)
local path = "$path"
graph export Graphs/`path'/rvfplot_m_`model'_`part'_`depvar'.png, replace
graph drop _all

      Since we use robust standard errors with - vce(cluster) - we don't
need to worry about heteroscedasticity ;) */

** 4. Checking for multicollinearity

display _n in yellow "Multicollinearity (m_$model, $depvar): VIF"

// Delete factor variables and time-series operators since they don't work
with -collin- global indepvars_new = "$indepvars"

foreach var in $indepvars {
      //display in yellow "Before: $indepvars_new"
      if (substr("`var'",1,2) == "i.") global indepvars_new =
      //display in yellow "After: $indepvars_new"

collin $indepvars_new


      "As I understand it, multicollinearity on the right-hand side
      is much the same irrespective of what is on the left-hand
      side or what link function is contemplated. So I don't
      see that you are obliged to do it retrospectively."

      The term collinearity implies that two variables are near perfect
linear combinations of one another.
      When more than two variables are involved it is often called
multicollinearity, although the two terms
      are often used interchangeably.The primary concern is that as the
degree of multicollinearity increases,
      the regression model estimates of the coefficients become unstable and
the standard errors for the
      coefficients can get wildly inflated.
      We can use the vif command after the regression to check for
multicollinearity. vif stands for variance
      inflation factor. As a rule of thumb, a variable whose VIF values are
greater than 10 may merit further
      investigation. Tolerance, defined as 1/VIF, is used by many
researchers to check on the degree of collinearity.
      A tolerance value lower than 0.1 is comparable to a VIF of 10. It
means that the variable could be considered
      as a linear combination of other independent variables.

** 5. Checking linearity

      The most straightforward thing to do is to plot the standardized
residuals against each of the predictor
      variables in the regression model. If there is a clear nonlinear
pattern, there is a problem of nonlinearity.
      acprplot graphs an augmented component-plus-residual plot, a.k.a.
augmented partial residual plot.
      It can be used to identify nonlinearities in the data.
      Often poorly chosen functional forms show themselves up in patterns on
      plot of residuals versus predicted. Conversely, a satisfactory
      functional form shows up, after some informal or formal smooth, as a
      flat trace on such plots. Naturally, you should check other
      too, e.g. variance functions.
      I see no reason why this should not apply in broad terms to panel data
      too. A disadvantage of this is that it does not produce a P-value or
      sharp test decision as you evidently prefer. An advantage is that it
      flexible and allows you to apply scientific as well as statistical

      If you want more formal tests then it really helps if you have an
      alternative, for instance a quadratic relationship, or a spline. If
      former is the case I would first use -orthpoly- to create orthogonal
      polynomials, that way the significance of the quadratic term can be
      interpreted as a test whether the quadratic terms adds anything. If
      your alternative is spline, I would use the marginal option. that way
      the second term measures how much the effect changes befor and after
      the knot, and the significance of that term tells you if that's
      significant. Finally, you could look at the -estat ovtest-.
      However, I think graphs are the best way to find non-linearities and
      assess whether they are big enough for you to worry about. I would
      use tests if I had a specific hypothesis about that non-linearity.

      // component-plus-residual-plot (partial residual plot)
      display _n in yellow "Linearity: Component-plus-residual-plot"

      foreach var in $indepvars {
            // don't create plot for categorical variables
            if (substr("`var'",1,2) == "i.") continue
            gen residuals_plus = residuals + _b[`var']*`var' //see Kohler &
Kreuter (2008, p. 214)
            scatter residuals_plus `var', name(g1) msize(tiny) || mband
residuals_plus `var', bands(50)
            // Conversion of global to local since -graph export- cannot
read globals
            local depvar = "$depvar"
            local model = "$model"
            local path = "$path"
            graph export
Graphs/`path'/cprplot_m_`model'_`part'_`depvar'_`var'.png, replace
            //scatter residuals_plus `var', name(g2) msize(tiny) || lowess
residuals_plus `var', sort yvarlabel("Lowess smooth") // Computation of
lowess smooth takes ages?
            drop residuals_plus
            graph drop _all

// augmented-component-plus-residual-plot
display _n in yellow "Linearity: Augmented-component-plus-residual-plot"

foreach var in $indepvars {
      // don't create plot for categorical variables
      if (substr("`var'",1,2) == "i.") continue

      gen sq = `var'^2
      areg $depvar $indepvars sq, absorb(pid) vce(cluster region)
      gen residuals_augm = residuals + _b[`var']*`var' + _b[sq]*sq
      scatter residuals_augm `var', name(g1) msize(tiny) || mband
residuals_augm `var', bands(50) || lfit residuals_augm `var',
lstyle(refline) legend(off)
      // Conversion of global to local since -graph export- cannot read
      local depvar = "$depvar"
      local model = "$model"
      local path = "$path"
      graph export
Graphs/`path'/acprplot_m_`model'_`part'_`depvar'_`var'.png, replace
      drop sq residuals_augm
      graph drop _all

** 6. Checking model specification

      A model specification error can occur when one or more relevant
variables are omitted from the model
      or one or more irrelevant variables are included in the model. If
relevant variables are omitted from
      the model, the common variance they share with included variables may
be wrongly attributed to those
      variables, and the error term is inflated. On the other hand, if
irrelevant variables are included in
      the model, the common variance they share with included variables may
be wrongly attributed to them.
      Model specification errors can substantially affect the estimate of
regression coefficients.
      The linktest command performs a model specification link test for
single-equation models.
      linktest is based on the idea that if a regression is properly
specified, one should not be able to
      find any additional independent variables that are significant except
by chance. linktest creates two
      new variables, the variable of prediction, _hat, and the variable of
squared prediction, _hatsq. The
      model is then refit using these two variables as predictors. _hat
should be significant since it is
      the predicted value. On the other hand, _hatsq shouldn't, because if
our model is specified correctly,
      the squared predictions should not have much explanatory power. That
is we wouldn't expect _hatsq to
      be a significant predictor if our model is specified correctly. So we
will be looking at the p-value
      for _hatsq.

      We find evidence that our model is misspecified.
      The coefficient for _hatsq after -linktest- is significant. And the
RESET(4) also rejects the H0.

display _n in yellow "Model specification: linktest"

gen yhat_sq = yhat^2

areg $depvar yhat yhat_sq [aw = phrf], absorb(pid) vce(cluster region) local
coeff_yhat_sq = _b[yhat_sq] display _n in yellow "p-value for yhat_sq should
be insignificant!"

drop yhat_sq

** 7. Checking issues of independence

      The statement of this assumption that the errors associated with one
observation are not correlated with
      the errors of any other observation cover several different
situations. Consider the case of collecting
      data from students in eight different elementary schools. It is likely
that the students within each
      school will tend to be more like one another than students from
different schools, that is, their errors
      are not independent. We will deal with this type of situation (...)
when we demonstrate the (...) cluster option.

      Another way in which the assumption of independence can be broken is
when data are collected on the same
      variables over time. Let's say that we collect truancy data every
semester for 12 years. In this situation
      it is likely that the errors for observation between adjacent
semesters will be more highly correlated than
      for observations more separated in time. This is known as
      Concerning autocorrelation, -xtserial- shows evidence for first-order
      Note that if the panel identifier (e.g. individuals, firms, or
countries) is the cluster() variable,
      then Rogers standard errors are heteroscedasticity and autocorrelation
      => Do the estimations again with vce(cluster pid) and check if the
results change.

xi: xtserial $depvar $indepvars

// Estimate with -xtregar- and report differences
xi: xtscc $depvar $indepvars, fe

      Further commands for treating autocorrelation:
      newey2 -> not with fixed effects

** 8. Checking measurement error

// ??

** 9. Checking exogeneity of variables

// Hausman tests!?

*   For searches and help try:

© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index