»  Home »  Products »  Features »  Fractional outcome regression

## Fractional outcome regression

### Highlights

• Model fractions, proportions, rates, etc.
• Fractional probit model
• Fractional logit model
• Fractional heteroskedastic probit model
• Odds ratios for fractional logit models
• Beta regression

### What's this about?

Fractional responses concern outcomes between zero and one.

The most natural way fractional responses arise is from averaged 0/1 outcomes. In such cases, if you know the denominator, you want to estimate such models using standard probit or logistic regression. For instance, the fractional response might be 0.25, but if the data also include that 4 out of 36 had a positive outcome, you can use the standard estimation commands.

Fractional response models are for use when the denominator is unknown. That can include averaged 0/1 outcomes such as participation rates, but can also include variables that are naturally on a 0 to 1 scale such as pollution levels, patient oxygen saturation, and Gini coefficients (inequality measures).

Fractional response estimators fit models on continuous zero to one data using probit, logit, heteroskedastic probit, and beta regression. Beta regression can be used only when the endpoints zero and one are excluded.

### Let's see it work

We are going to analyze an air-pollution index that is scaled 0 to 1, inclusive, although 1 (complete pollution) is virtually impossible, and in our data, we observe values only up to 0.8. We do observe the opposite endpoint, however. Zero means no measurable pollution. Our data are for various cities.

In this 0 to 1 variable, values between 0 and 0.3 have no public health implications, and values greater than 0.7 imply people with breathing or health problems should remain indoors.

We model pollution as determined by the number of older, pollution-producing cars per capita; percentage of output due to industry; and annual rainfall. We use probit. We type

. fracreg probit pollution cars i.policy industrial

Iteration 0:   log pseudolikelihood = -956.99975
Iteration 1:   log pseudolikelihood = -687.24995
Iteration 2:   log pseudolikelihood =   -686.306
Iteration 3:   log pseudolikelihood = -686.30593
Iteration 4:   log pseudolikelihood = -686.30593

Fractional probit regression                           Number of obs =   1,234
Wald chi2(3)  = 8360.33
Prob > chi2   =  0.0000
Log pseudolikelihood = -686.30593                      Pseudo R2     =  0.1189

 Robust pollution Coefficient std. err. z P>|z| [95% conf. interval] cars .5051425 .0159369 31.70 0.000 .4739069 .5363782 policy policy -.9886715 .0114137 -86.62 0.000 -1.011042 -.9663011 industrial .2726658 .0298289 9.14 0.000 .2142022 .3311294 _cons -.6952776 .0265953 -26.14 0.000 -.7474034 -.6431518

We find more pollution where there are older cars, less rainfall, and more industry. How good are you at reading probit's N(0,1) standardized coefficients?

margins will make interpreting our results easier. We can ask margins to report elasticities, which is to say, the percentage change in pollution for a 1% change in the covariate:

. margins, at(cars==(1(1)3))

Predictive margins                                       Number of obs = 1,234
Model VCE: Robust

Expression: Conditional mean of pollution, predict()
1._at: cars = 1
2._at: cars = 2
3._at: cars = 3

 Delta-method Margin std. err. z P>|z| [95% conf. interval] _at .2694474 .0023145 116.42 0.000 .264911 .2739837 1 .4334544 .0041046 105.60 0.000 .4254095 .4414992 3 .6112179 .0092085 66.38 0.000 .5931696 .6292661

We find that a 1% increase of older cars per capita increases pollution by 0.041, a 1% increase in rainfall decreases pollution by 0.058, and a 1% increase in industrial production increases pollution by 0.035.

A truly careful reader will have noticed that we typed dyex(), not eyex(). The dependent variable is already a proportion and so is already on a percentage scale. We just need its change, not its percentage change.

### Let's see it work with beta regression

Let's look at the effect of democratic institutions on income inequality. We have fictional data on a cross-section of countries in which inequality is measured using the Gini coefficient. The Gini coefficient is one if one person has all the income in a society and zero if income is equally divided among everyone. Values of zero and one simply do not happen, of course. In our data, the average Gini coefficient is 0.41. For your information, Sweden's coefficient is roughly 0.23 in 2005 (they are proud of their equality), and Haiti's is 0.59.

The beta distribution is often used to model the Gini coefficient and other zero to one variables that can have long tails and exclude the endpoints. We type

. betareg  gini i.rural i.democracy i.colony

Beta regression                                 Number of obs     =        160
LR chi2(6)        =     146.52
Prob > chi2       =     0.0000

Link function  :  g(u) = log(u/(1-u))           [Logit]
Slink function :  g(u) = log(u)                 [Log]

Log likelihood =  157.79178

 gini Coefficient Std. err. z P>|z| [95% conf. interval] gini rural rural .1567357 .0680008 2.30 0.021 .0234567 .2900147 democracy low -.4798286 .0748253 -6.41 0.000 -.6264834 -.3331737 medium -.7774981 .0931349 -8.35 0.000 -.9600391 -.594957 med-high -1.303923 .1363737 -9.56 0.000 -1.571211 -1.036636 high -1.521037 .1775991 -8.56 0.000 -1.869125 -1.17295 colony colony .2368402 .0805578 2.94 0.003 .0789498 .3947306 _cons -.0471008 .0528853 -0.89 0.373 -.150754 .0565524 scale _cons 3.279796 .1099443 29.83 0.000 3.064309 3.495283

We have modeled income inequality on the country's ruralness, level of democracy, and whether it was a previous colony. In these fictional data, former colonies tend to have higher inequality, and the stronger the democracy, the less the inequality.

We will use margins to make the effect of democracy easier to interpret:

. margins, dydx(democracy)

Average marginal effects                                   Number of obs = 160
Model VCE: OIM

Expression: Conditional mean of gini, predict()
dy/dx wrt:  1.democracy 2.democracy 3.democracy 4.democracy

 Delta-method dy/dx std. err. z P>|z| [95% conf. interval] democracy low -.1178869 .0181028 -6.51 0.000 -.1533678 -.082406 medium -.1860353 .0210805 -8.82 0.000 -.2273524 -.1447183 med-high -.2893892 .0249533 -11.60 0.000 -.3382967 -.2404817 high -.3245293 .0284448 -11.41 0.000 -.38028 -.2687785
Note: dy/dx for factor levels is the discrete change from the base level.

Reported are the change in the outcome variable (inequality) for a change in democracy. The base (omitted) category is total absence of democracy. Thus, being categorized as low relative to total absence of democracy decreases inequality by 0.12. Being categorized medium further decreases inequality 0.19, and so on.

### Tell me more

Read more about fractional response and beta regression models in the Stata Base Reference Manual; see [R] fracreg and [R] betareg.