Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: What multiple regression model for extreme distributions


From   Michael Norman Mitchell <Michael.Norman.Mitchell@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: What multiple regression model for extreme distributions
Date   Tue, 02 Feb 2010 12:59:27 -0800

Dear Muhammed

I think that it is possible that this is more of a question of theory than a statistical question. The great answers that have been posted reflect, as I see it, different theoretical assumptions about the nature of the outcome and how the predictors are related to the outcome (of saving). I think that these different statistical suggestions each could be valid under different theoretical frameworks. Perhaps returning to the literature on the nature of "saving" to get a theoretical basis would help to inform the statistical model that should be selected. It also could be an opportunity to see what statistical models have been accepted in publications in the past.

I know this is more work, but if the aim is publication, it may be worthwhile.

Best regards,

Michael N. Mitchell
See the Stata tidbit of the week at...
http://www.MichaelNormanMitchell.com
Visit me on Facebook at...
http://www.facebook.com/MichaelNormanMitchell


muhammed abdul khalid wrote:
Hi,
Thank you for the replies.

The data is cross sectional, and saving is simply measured based on
respondents answer on how much saving they have ( in dollars) with the
minimum being zero. There is no negative saving. Yes, saving is my
dependent variable.

I tried logit, zip, zinb, nbreg  but their std error varies greatly.
Still unsure to what model should be used. My objective is to predict
the contribution of education, gender, location and  ethnicity to
saving of the household.

Thank you again for kind response.

Muhammed
SciencesPo Paris.






2010/2/2 Austin Nichols <austinnichols@gmail.com>:
You have had a number of good suggestions already, but as Nick Cox
points out, the distribution of the dependent variable is not all that
relevant to what model you choose; it is the distribution of the
dependent variable conditional on explanatory variables that is
important.  Before you estimate a two-part "hurdle" or zero-inflated
model, I urge you to consider that the right set of explanatory
variables might well capture the reason for a large number of zero
outcomes (e.g. using -poisson- instead of -zip- etc.).  When it comes
to household saving (I think that is your dependent variable, not
independent), you also want to consider debt.  It may be the case that
households you are coding as zeros actually have negative saving
during the period under study.  Do you have panel data, or
cross-sectional data?  How is saving measured?

On Tue, Feb 2, 2010 at 10:09 AM, <muhammed.abdulkhalid@gmail.com> wrote:
I have a household income survey data ( 38,000 observations), and my
problem is doing a multiple regression on saving ( independent var) to
ethnicity/strata/employment
etc( dependent var).

The problem is this : 70% of my observation for the value of saving is
zero. I had recode it to 1 and log them, but the distribution is still
extremely skewed ( mean 0.78, std dev is 2.4  min 0 max 14). The
historgam still looks like the letter L , exteremly skewed to the
right with  long tail.  Obviously, OLS is out, and I tried Poisson(
glm nbinomial) but the distribution is still not distributed normally.
The data are in order i.e no missing values etc etc. It is clean.For
some reason, lobit would not run.
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/




*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index