Notice: On March 31, it was **announced** that Statalist is moving from an email list to a **forum**. The old list will shut down at the end of May, and its replacement, **statalist.org** is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
"Tobias Pfaff" <tobias.pfaff@uni-muenster.de> |

To |
<statalist@hsphsun2.harvard.edu> |

Subject |
st: Regression diagnostics after -xtreg- |

Date |
Wed, 7 Nov 2012 11:53:18 +0100 |

Dear all, Some time ago I had posted a question on how to do regression diagnostics after -xtreg- (http://www.stata.com/statalist/archive/2011-08/msg01063.html). I never received an answer and tried to come up with solutions myself. In the meantime I have received a few private messages on how I eventually proceeded. I post the code that I used below. Beware that I cannot guarantee that the code is correct! I just want to post it for discussion in case it helps anybody. Cheers, Tobias Code: /* Notes: Without verifying that your data have met the assumptions underlying OLS regression, your results may be misleading. In particular, we will consider the following assumptions. > Linearity - the relationships between the predictors and the outcome variable should be linear > Normality - the errors should be normally distributed - technically normality is necessary only for hypothesis tests to be valid, estimation of the coefficients only requires that the errors be identically and independently distributed > Homogeneity of variance (homoscedasticity) - the error variance should be constant > Independence - the errors associated with one observation are not correlated with the errors of any other observation (e.g., autocorrelation) > Errors in variables - predictor variables are measured without error (we will cover this in Chapter 4) > Model specification - the model should be properly specified (including all relevant variables, and excluding irrelevant variables) Additionally, there are issues that can arise during the analysis that, while strictly speaking are not assumptions of regression, are none the less, of great concern to data analysts. > Influence - individual observations that exert undue influence on the coefficients > Collinearity - predictors that are highly collinear, i.e., linearly related, can cause problems in estimating the regression coefficients. [http://www.ats.ucla.edu/stat/stata/webbooks/reg/chapter2/statareg2.htm] */ *********************************************** ** 1. Unusual and influential data /* Notes: A single observation that is substantially different from all other observations can make a large difference in the results of your regression analysis. If a single observation (or small group of observations) substantially changes your results, you would want to know about this and investigate further. There are three ways that an observation can be unusual. Outliers: In linear regression, an outlier is an observation with large residual. In other words, it is an observation whose dependent-variable value is unusual given its values on the predictor variables. An outlier may indicate a sample peculiarity or may indicate a data entry error or other problem. Leverage: An observation with an extreme value on a predictor variable is called a point with high leverage. Leverage is a measure of how far an observation deviates from the mean of that variable. These leverage points can have an effect on the estimate of regression coefficients. Influence: An observation is said to be influential if removing the observation substantially changes the estimate of coefficients. Influence can be thought of as the product of leverage and outlierness. */ // Added-variable plots display _n in yellow "Unusual and influential data: Added-variable plots" foreach var in $indepvars { // Don't create plot for categorical variables if (substr("`var'",1,2) == "i.") continue quietly { local RHS = subinword("$indepvars", "`var'", "", 1) di _n in yellow "var: `var'" di _n in yellow "Before: $indepvars" di _n in yellow "After: `RHS'" di _n in yellow "depvar: $depvar" // Step 1: Regression of y on all independent variables without `var' areg $depvar `RHS' [aw = phrf], absorb(pid) vce(cluster region) // Generate residuals predict res_1_`part' if e(sample), residuals // Step 2: Regression of `var' on all other independent variables areg `var' `RHS' [aw = phrf], absorb(pid) vce(cluster region) // Generate residuals predict res_2_`part' if e(sample), residuals // Regress both residuals on each other to obtain statistics displayed in added-variable plot regress res_1_`part' res_2_`part' local coef = round(_b[res_2_`part'], .0001) local se = round(_se[res_2_`part'], .0001) local t = round(_b[res_2_`part']/_se[res_2_`part'],.01) } twoway scatter res_1_`part' res_2_`part', name(g1, replace) msize(tiny) legend(off) /// ytitle("e(life_sat | X)") xtitle("e(`var' | X)") note("coef = `coef', se = `se', t = `t'") || /// lfit res_1_`part' res_2_`part' //mlabel(pid) // Conversion of global to local since -graph export- cannot read globals local depvar = "$depvar" local model = "$model" local path = "$path" graph export Graphs/`path'/avplot_m_`model'_`part'_`depvar'_`var'.png, replace drop res_* graph drop _all } /* Notes: How added-variable plots should be interpreted... The horizontal variable in the added-variable plot is the residual from the regression of the respective variable on all other independent variables (X). Thus, data points going to the far left/right mean that the value of the respective variable is unusually low/high given all other independent variables (X). Unusual data points can be marked by adding -mlabel(pid)- to the graph. [http://socserv.socsci.mcmaster.ca/jfox/Courses/Brazil-2009/slides-handout .pdf] */ ************************************************** ** 2. Checking normality of residuals /* Notes: Many researchers believe that multiple regression requires normality. This is not the case. Normality of residuals is only required for valid hypothesis testing, that is, the normality assumption assures that the p-values for the t-tests and F-test will be valid. Normality is not required in order to obtain unbiased estimates of the regression coefficients. OLS regression merely requires that the residuals (errors) be identically and independently distributed. Furthermore, there is no assumption or requirement that the predictor variables be normally distributed. If this were the case than we would not be able to use dummy coded variables in our models. See also: http://www.bwl.uni-kiel.de/bwlinstitute/grad-kolleg/new/typo3conf/ext/naw_ securedl/secure.php?u=0&file=/fileadmin/publications/pdf/2010_Methodik_der _empirischen_Forschung_-_Normalverteilungsannahme__Arne_Schmidt_.pdf&t=127 2745396&hash=047d8763f4331cf740a478480fd48464 */ // Graphical analysis kdensity residuals, normal name(g1, replace) graph box residuals, name(g2, replace) symplot residuals, name(g3, replace) qnorm residuals, name(g4, replace) pnorm residuals, name(g5, replace) /* Notes: The pnorm command graphs a standardized normal probability (P-P) plot while qnorm plots the quantiles of a variable against the quantiles of a normal distribution. pnorm is sensitive to non-normality in the middle range of data and qnorm is sensitive to non-normality near the tails. */ graph combine g1 g2 g3 g4 g5, name(gcomb2, replace) // Conversion of global to local since -graph export- cannot read globals local depvar = "$depvar" local model = "$model" local path = "$path" graph export Graphs/`path'/normality_residuals_m_`model'_`part'_`depvar'.png, replace graph drop _all // Formal tests iqr residuals jb residuals sktest residuals //swilk can only be used with less than 2000 obs. and sfrancia with less than 5000 obs. /* Notes: iqr stands for inter-quartile range and assumes the symmetry of the distribution. Severe outliers consist of those points that are either 3 inter-quartile-ranges below the first quartile or 3 inter-quartile-ranges above the third quartile. The presence of any severe outliers should be sufficient evidence to reject normality at a 5% significance level. Mild outliers are common in samples of any size. The p-value of the sktest is based on the assumption that the distribution is normal. For example, a p-value of .15 indicates that we cannot reject that the residuals are normally distributed at the 5 percent level. */ // Test for normality with pantest2 //pantest2 svyyear ******************************************** ** 3. Checking homoscedasticity // residual vs. fitted plot display _n in yellow "Homoscedasticity: Residual-vs-fitted-plot" scatter residuals yhat, name(g1) msize(tiny) || mband residuals yhat, bands(50) clp(solid) local path = "$path" graph export Graphs/`path'/rvfplot_m_`model'_`part'_`depvar'.png, replace graph drop _all /* Notes: Since we use robust standard errors with - vce(cluster) - we don't need to worry about heteroscedasticity ;) */ ************************************************* ** 4. Checking for multicollinearity display _n in yellow "Multicollinearity (m_$model, $depvar): VIF" // Delete factor variables and time-series operators since they don't work with -collin- global indepvars_new = "$indepvars" foreach var in $indepvars { //display in yellow "Before: $indepvars_new" if (substr("`var'",1,2) == "i.") global indepvars_new = subinstr("$indepvars_new","`var'","",1) //display in yellow "After: $indepvars_new" } collin $indepvars_new /* Notes: "As I understand it, multicollinearity on the right-hand side is much the same irrespective of what is on the left-hand side or what link function is contemplated. So I don't see that you are obliged to do it retrospectively." [http://www.stata.com/statalist/archive/2003-12/msg00335.html] The term collinearity implies that two variables are near perfect linear combinations of one another. When more than two variables are involved it is often called multicollinearity, although the two terms are often used interchangeably.The primary concern is that as the degree of multicollinearity increases, the regression model estimates of the coefficients become unstable and the standard errors for the coefficients can get wildly inflated. We can use the vif command after the regression to check for multicollinearity. vif stands for variance inflation factor. As a rule of thumb, a variable whose VIF values are greater than 10 may merit further investigation. Tolerance, defined as 1/VIF, is used by many researchers to check on the degree of collinearity. A tolerance value lower than 0.1 is comparable to a VIF of 10. It means that the variable could be considered as a linear combination of other independent variables. */ ************************************* ** 5. Checking linearity /* Notes: The most straightforward thing to do is to plot the standardized residuals against each of the predictor variables in the regression model. If there is a clear nonlinear pattern, there is a problem of nonlinearity. acprplot graphs an augmented component-plus-residual plot, a.k.a. augmented partial residual plot. It can be used to identify nonlinearities in the data. Often poorly chosen functional forms show themselves up in patterns on a plot of residuals versus predicted. Conversely, a satisfactory functional form shows up, after some informal or formal smooth, as a flat trace on such plots. Naturally, you should check other diagnostics too, e.g. variance functions. I see no reason why this should not apply in broad terms to panel data too. A disadvantage of this is that it does not produce a P-value or sharp test decision as you evidently prefer. An advantage is that it is flexible and allows you to apply scientific as well as statistical judgement. [http://www.stata.com/statalist/archive/2009-10/msg00731.html] If you want more formal tests then it really helps if you have an alternative, for instance a quadratic relationship, or a spline. If the former is the case I would first use -orthpoly- to create orthogonal polynomials, that way the significance of the quadratic term can be interpreted as a test whether the quadratic terms adds anything. If your alternative is spline, I would use the marginal option. that way the second term measures how much the effect changes befor and after the knot, and the significance of that term tells you if that's significant. Finally, you could look at the -estat ovtest-. However, I think graphs are the best way to find non-linearities and assess whether they are big enough for you to worry about. I would only use tests if I had a specific hypothesis about that non-linearity. [http://www.stata.com/statalist/archive/2006-11/msg00448.html] */ /* // component-plus-residual-plot (partial residual plot) display _n in yellow "Linearity: Component-plus-residual-plot" foreach var in $indepvars { // don't create plot for categorical variables if (substr("`var'",1,2) == "i.") continue gen residuals_plus = residuals + _b[`var']*`var' //see Kohler & Kreuter (2008, p. 214) scatter residuals_plus `var', name(g1) msize(tiny) || mband residuals_plus `var', bands(50) // Conversion of global to local since -graph export- cannot read globals local depvar = "$depvar" local model = "$model" local path = "$path" graph export Graphs/`path'/cprplot_m_`model'_`part'_`depvar'_`var'.png, replace //scatter residuals_plus `var', name(g2) msize(tiny) || lowess residuals_plus `var', sort yvarlabel("Lowess smooth") // Computation of lowess smooth takes ages? drop residuals_plus graph drop _all } */ // augmented-component-plus-residual-plot display _n in yellow "Linearity: Augmented-component-plus-residual-plot" foreach var in $indepvars { // don't create plot for categorical variables if (substr("`var'",1,2) == "i.") continue gen sq = `var'^2 areg $depvar $indepvars sq, absorb(pid) vce(cluster region) gen residuals_augm = residuals + _b[`var']*`var' + _b[sq]*sq scatter residuals_augm `var', name(g1) msize(tiny) || mband residuals_augm `var', bands(50) || lfit residuals_augm `var', lstyle(refline) legend(off) // Conversion of global to local since -graph export- cannot read globals local depvar = "$depvar" local model = "$model" local path = "$path" graph export Graphs/`path'/acprplot_m_`model'_`part'_`depvar'_`var'.png, replace drop sq residuals_augm graph drop _all } *********************************************** ** 6. Checking model specification /* Notes: A model specification error can occur when one or more relevant variables are omitted from the model or one or more irrelevant variables are included in the model. If relevant variables are omitted from the model, the common variance they share with included variables may be wrongly attributed to those variables, and the error term is inflated. On the other hand, if irrelevant variables are included in the model, the common variance they share with included variables may be wrongly attributed to them. Model specification errors can substantially affect the estimate of regression coefficients. The linktest command performs a model specification link test for single-equation models. linktest is based on the idea that if a regression is properly specified, one should not be able to find any additional independent variables that are significant except by chance. linktest creates two new variables, the variable of prediction, _hat, and the variable of squared prediction, _hatsq. The model is then refit using these two variables as predictors. _hat should be significant since it is the predicted value. On the other hand, _hatsq shouldn't, because if our model is specified correctly, the squared predictions should not have much explanatory power. That is we wouldn't expect _hatsq to be a significant predictor if our model is specified correctly. So we will be looking at the p-value for _hatsq. Results: We find evidence that our model is misspecified. The coefficient for _hatsq after -linktest- is significant. And the RESET(4) also rejects the H0. */ display _n in yellow "Model specification: linktest" gen yhat_sq = yhat^2 areg $depvar yhat yhat_sq [aw = phrf], absorb(pid) vce(cluster region) local coeff_yhat_sq = _b[yhat_sq] display _n in yellow "p-value for yhat_sq should be insignificant!" drop yhat_sq ************************************************** ** 7. Checking issues of independence /* Notes: The statement of this assumption that the errors associated with one observation are not correlated with the errors of any other observation cover several different situations. Consider the case of collecting data from students in eight different elementary schools. It is likely that the students within each school will tend to be more like one another than students from different schools, that is, their errors are not independent. We will deal with this type of situation (...) when we demonstrate the (...) cluster option. Another way in which the assumption of independence can be broken is when data are collected on the same variables over time. Let's say that we collect truancy data every semester for 12 years. In this situation it is likely that the errors for observation between adjacent semesters will be more highly correlated than for observations more separated in time. This is known as autocorrelation. Results: Concerning autocorrelation, -xtserial- shows evidence for first-order autocorrelation. Note that if the panel identifier (e.g. individuals, firms, or countries) is the cluster() variable, then Rogers standard errors are heteroscedasticity and autocorrelation consistent. => Do the estimations again with vce(cluster pid) and check if the results change. */ xi: xtserial $depvar $indepvars // Estimate with -xtregar- and report differences xi: xtscc $depvar $indepvars, fe /* Further commands for treating autocorrelation: xtabond newey2 -> not with fixed effects xtgls */ ********************************************* ** 8. Checking measurement error // ?? *************************************************** ** 9. Checking exogeneity of variables // Hausman tests!? // http://www.stata.com/support/faqs/stat/endogeneity.html * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/

- Prev by Date:
**st: 95%CrI** - Next by Date:
**st: Breusch Pagan test after xtregar** - Previous by thread:
**st: 95%CrI** - Next by thread:
**st: Breusch Pagan test after xtregar** - Index(es):