# Re: st: regression with heavily skewed dependant data (was "tobit?")

 From "Seed, Paul" To "statalist@hsphsun2.harvard.edu" Subject Re: st: regression with heavily skewed dependant data (was "tobit?") Date Wed, 13 Aug 2008 12:06:53 +0100

```Most responses have focussed on the question of whether & how to use least squares regression with Mona's data.

Can I raise some other issues, based on experience with similar data:
1) Clean the data thoroughly.  Check in particular for heights <5 ft or > 6ft 3in and BMI <15 or >50 kg/m2. Also look out for errors in units - weights in pounds rather than kg and heights inches rather than cm.  Correct or remove all such weird outliers.
2) Clarify the research question. Both socio-economic status and body mass index have very complex causes and effects.  SES, in particular is not well defined.  "Evaluating the effect of SES on BMI " is not possible in an observational study.

Correcting for possible confounders will not alter the fact that you don't know to what extent obesity causes low SES, or low SES causes obesity - obesity can cause low SES via poor health, low self esteem, & unemployment; while low SES can cause obesity via low self esteem, lack of activity, cheap fat-rich diet...

Exactly what you do depends partly on what your aims are, partly on your sample size, partly on the nature of the data.

3) When considering possible transformations, I would aim for ones that give approximate symmetry.  In my own data, 1/BMI = height^2/Weight performed quite well. Skewness -.0393561, Kurtosis 3.708181.

4) I would consider the -xriml- commands by Eileen Wright & Patrick Royston (STB 40), which give a very wide range of transformations for the dependent variable, but also allow for fitting non-linear relationships via fractional polynomials, and give graphs of the relationship on the original scales, with reference ranges & actual values plotted.  These are a very good way of checking how accurately the model describes the data.  Note: not how accurately SES predicts BMI. "No relationship between SES & BMI" could in principle be a very accurate description of reality.

Use -findit xriml- to find & install the latest version of the command.

5)  The most interesting result would perhaps be to show a maximum or minimum; second would be to show that all the effect was confined to very low (or very high) levels of SES.  Make sure your models allow for such possibilities.  Again not a problem in xriml

Quoting Mona Mowafi <mmowafi@hsph.harvard.edu>:

> Dear statalisters,
>
> I have a dataset in which I am evaluating the effect of SES on BMI
> and BMI is heavily skewed toward obesity (i.e. over 50% of the
> sample >30 BMI).  I preferred to run a linear regression so as to
> use the full range of data, but the outcome distribution violates
> normality assumption and I've tried ln, log10, and sqrt
> transformations, none of which work.
>
> Is it appropriate to use tobit for modeling BMI in this instance?
> If not, any suggestions?
>
> Your insight is much appreciated.
>
> Many thanks,
> Mona

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```