# Re: st: stata code for two-part model

 From Shehzad Ali To statalist@hsphsun2.harvard.edu Subject Re: st: stata code for two-part model Date 19 Aug 2008 06:33:49 +0100

Thank you all for your very useful thoughts on this issue.
I am running regression on two separate sets of expenditure data: one for general health expenditure which includes all costs including those for self-medication etc., and second for expenditure related to formal health care, including primary and hospital care but excluding self-medication.

I agree that two-part model is not the best option but is -heckman- model a resaonable alternative if the selection step is for zero/non-zero expenditure and outcome for the positive expenditure? Looking at Austin's argument, I understand that -heckman- run into similar problem as two-part model. Is that right?

On Aug 18 2008, Austin Nichols wrote:

```In expectation?  People who have truly zero probability of incurring
hospital costs?

On Mon, Aug 18, 2008 at 1:08 PM, Lachenbruch, Peter
<Peter.Lachenbruch@oregonstate.edu> wrote:
```
```The problem was about hospitalization costs.  These can be true zeros.

Tony

Peter A. Lachenbruch
Department of Public Health
Oregon State University
Corvallis, OR 97330
Phone: 541-737-3832
FAX: 541-737-4001

-----Original Message-----
From: owner-statalist@hsphsun2.harvard.edu
[mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of Austin
Nichols
Sent: Monday, August 18, 2008 9:38 AM
To: statalist@hsphsun2.harvard.edu
Subject: Re: st: stata code for two-part model

Peter <Peter.Lachenbruch@oregonstate.edu>:
I think this claim is a bit of a red herring: "use of a continuous
model for data in which there is a clump of zeros seems incorrect."
Note that the -glm- approach assumes the mean of y given observables X
is nonzero, and E(y|X)=exp(Xb), not that observed y is nonzero!
Including the observations where y=0 is the whole point of the -glm-
approach--otherwise we would run ols regression of ln(y) on X.  And if
you are claiming that the "true" model for (expected) healthcare
expenditures does have true zeros that are identifiable, then I
disagree. Some of your obs may spend nothing on health care (though
annual spending, including myriad items such as aspirin, is unlikely
to truly be zero for anyone) but that does not mean their conditional
mean should be zero.  Maybe people who are dead have a conditional
mean of zero, but they should probably be excluded from the
analysis...

When spending is measured in discrete dollars, a big clump of people
who have predicted spending less than 50 cents may have a conditional
mean of zero measured in the same units as the data.  But that does
not mean their "true" conditional mean is zero.

That said, a demand/expenditure model will have more and more "true"
(or rounded off) zeros as the category of demand/expenditure gets
narrower and narrower and the time window over which it is measured
gets narrower... think aspirin expenditures by week or day... but it
is not clear to me that a two-part model is the right approach even in
those cases.

On Mon, Aug 18, 2008 at 11:33 AM, Lachenbruch, Peter
<Peter.Lachenbruch@oregonstate.edu> wrote:
```
```In some instances, the model for healthcare expenditures does have
```
```true
```
```zeros that are identifiable.  In one study I consulted on the data
```
```came
```
```from a health insurer, and zeros were people who had not gone to
hospital.

The use of a continuous model for data in which there is a clump of
zeros seems incorrect.  There is no transformation that can remove
```
```this
```
```clump.  The severity of the problem depends a bit on the size of the
clump.  In the hospital insurance data (wanting to estimate
hospitalization costs in the policy holders) 95% of the population had
no costs.  Pretending that these were continuous would lead to some
nonsense results.  At the present time, I have a data set that has 32
out of 145 people with zeros.  However, these are not necessarily
identifiable since they could be slightly greater than zero.  I'm
gritting my teeth on this and pretending all is well.  However, a
histogram shows enormous skewness.  I'll probably try a square root.

Tony

Peter A. Lachenbruch
Department of Public Health
Oregon State University
Corvallis, OR 97330
Phone: 541-737-3832
FAX: 541-737-4001

-----Original Message-----
From: owner-statalist@hsphsun2.harvard.edu
[mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of Austin
Nichols
Sent: Saturday, August 16, 2008 8:50 AM
To: statalist@hsphsun2.harvard.edu
Subject: Re: st: stata code for two-part model

http://www.nber.org/papers/t0228
The two part models of health expenditures have always struck me as a
bad idea; think about how you would get predictions for each indiv in
your sample.  The "stage 1" probit classifies people as having
expenditures or not (some correctly, some not) and then the "stage 2"
ols model gives predicted expenditures only for those people who
actually have positive expenditures (not those who are classified by
the probit as likely to have positive expenditures) unless you predict
out of sample.  At least one preferred approach of calculating
marginal effects by comparing predictions over the whole sample turns
out to be practically and analytically difficult in that setting.
However, a -glm- with a log link (or equivalently a -poisson-
regression) has no trouble: those people with extremely low predicted
expenditures would round to zero predicted expenditures if you thought
about a survey with expenditures measured discretely in dollars, say.
Everyone has E(y)=exp(Xb) and there is no real issue with calculating
marginal effects.  Once you are in the -glm- framework it is also easy

On Sat, Aug 16, 2008 at 3:41 AM, Eva Poen <eva.poen@gmail.com> wrote:
```
```Shehzad,

this looks like a hurdle model. Have you search the ssc archives to
see if someone else has programmed it for you? Have a look at
-hplogit-, for example.

If you end up doing it yourself, I think you need to do a bit of
programming. In order for -mfx- to work after your estimation, you
need a way of telling it what you want the marginal effects to be
calculated for. In your case, this would be the overall expected cost
of care from your model. The way to feed this to -mfx- is via the
predict(predict_option), but for this to work you need to write a
-predict- command and an estimation command for your model.

See for example this post:
http://www.stata.com/statalist/archive/2005-10/msg00091.html

Hope this helps,
Eva

```
```Hi,

I was wondering if someone can help with stata code for calculating
```
```marginal
```
```effects after two-part models for say, cost of care. Here, first
```
```part
```
```is a
```
```probit model for seeking care or not, and the second part is an OLS
```
```model of
```
```cost of care, conditional on decision to seek care. Here is the
```
```simplified
```
```code:

probit care \$xvar

reg cost \$zvar if care==1

mfx

I understand that mfx after the second part gives us the marginal
```
```effects
```
```for the OLS part only, and not the conditional marginal effects.

Any help would be appreciated.

Thanks,

```
```*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

```
```*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

```
```--
Department of Social Policy & Social Work
University of York
YO10 5NG
+44 (0) 773-813-0094

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```