Note: This FAQ is for users of Stata 8, and older versions of Stata. It is
not relevant for Stata 9 since nlogit in Stata 9 runs on datasets with
unbalanced panels.
Why do I get an "unbalanced data" error message when I run nlogit?
|
Title
|
|
Nested logit models
|
|
Author
|
Gustavo Sanchez, StataCorp
|
|
Date
|
October 2004
|
The data for nlogit must be laid out such that, for each observation,
there is a record for each choice at the terminal nodes of the tree. If your
data is not laid out this way, you will get the "unbalanced data" error.
Let's say that my tree looks like this:
.
/ \
/ \
_l__ / \__r_
/ | \ / \
/ | \ / \
/ | \ / \
d e f g h
Suppose the group variable is named id and the response y. I could create a
new variable, calling it mid, that labels the mid-level nodes using the
nlogitgen command:
. nlogitgen mid = leaf(l:d|e|f, r:g|h)
Thus the first two observations of the data might look something like this:
rec id mid leaf1 y x1 x2 x3
1 1 l d 0 1 2.3 -1.0
2 1 l e 0 1 3.3 -1.1
3 1 l f 1 0 4.5 .
4 1 r g 0 0 1.3 -2.3
5 1 r h 0 0 5.5 -1.7
6 2 l d 0 1 1.2 -2.0
7 2 l e 0 1 4.0 -0.7
8 2 l f 0 1 2.0 -1.0
9 2 r g 1 . 5.1 -0.9
10 2 r h 0 0 6.1 -0.8
. . .
Here each observation consumes 5 records since there are 5 leaf nodes.
If I have three covariates, x1-x3, the call to nlogit might look like this:
. nlogit y (leaf = x1 x2) (mid = x3), group(id)
The most frequent cause for the unbalanced-data error is missing values in
your covariates, as demonstrated in the data listing above. Variables x1 and x3
have missing values for records 3 and 9, respectively. nlogit drops those
records from the analysis, thereby making the data incomplete.
The following examples use the dataset "restaurant" from the StataCorp website.
The examples are refer to the tree structure below, which implies a
first-level choice of having dinner at a fast food restaurant, at a family
restaurant, or at a fancy restaurant. Then, once the type of restaurant is
selected, the bottom level corresponds to the final decision about the
specific restaurant chosen.
Dining
/ \
/ | \
/ | \
/ | \
/ | \
/ | \
Fast Food Family Fancy
/ \ / | \ / \
/ \ / | \ / \
/ \ / | \ / \
M F W L C C M
P B M N E C
The code below reproduces the example in [R] nlogit. The
middle variable and a set of explanatory variables are generated, and then the
nested logit model is estimated:
clear
webuse restaurant
nlogitgen type=restaurant(Fast:Freebirds|MamasPizza, ///
Family:CafeEccell|LosNortenos|WingsNmore, ///
Fancy: Christophers|MadCows)
gen incFast =(type==1)*income
gen incFancy =(type==3)*income
gen kidFast =(type==1)*kids
gen kidFancy =(type==3)*kids
nlogit chosen (restaurant=cost rating distance) ///
(type= incFast incFancy kidFast kidFancy), ///
group(family_id) nolog
top --> bottom
type restaurant
--------------------------
Fast Freebirds
MamasPizza
Family CafeEccell
LosNorte~s
WingsNmore
Fancy Christop~s
MadCows
Nested logit estimates
Levels = 2 Number of obs = 2100
Dependent variable = chosen LR chi2(10) = 199.6293
Log likelihood = -483.9584 Prob > chi2 = 0.0000
------------------------------------------------------------------------------
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
restaurant |
cost | -.0944352 .03402 -2.78 0.006 -.1611131 -.0277572
rating | .1793759 .126895 1.41 0.157 -.0693338 .4280855
distance | -.1745797 .0433352 -4.03 0.000 -.2595152 -.0896443
-------------+----------------------------------------------------------------
type |
incFast | -.0287502 .0116242 -2.47 0.013 -.0515332 -.0059672
incFancy | .0458373 .0089109 5.14 0.000 .0283722 .0633024
kidFast | -.0704164 .1394359 -0.51 0.614 -.3437058 .2028729
kidFancy | -.3626381 .1171277 -3.10 0.002 -.5922041 -.1330721
-------------+----------------------------------------------------------------
(incl. value |
parameters) |
type |
/fast | 5.715758 2.332871 2.45 0.014 1.143415 10.2881
/family | 1.721222 1.152002 1.49 0.135 -.5366608 3.979105
/Fancy | 1.466588 .4169075 3.52 0.000 .6494642 2.283711
------------------------------------------------------------------------------
LR test of homoskedasticity (iv = 1): chi2(3)= 9.90 Prob > chi2 = 0.0194
------------------------------------------------------------------------------
Using this nlogit model as the base for comparison, let's
modify the data and check whether a problem arises in the
estimation.
First, let's erase the information on the explanatory variable rating for four
families:
replace rating=. if family_id==65 | family_id==146 | ///
family_id==220 | family_id==285
Then, we will estimate the same nested logit model that we estimated above:
nlogit chosen (restaurant=cost rating distance) ///
(type= incFast incFancy kidFast kidFancy), ///
group(family_id) nolog
tree structure specified for the nested logit model
top --> bottom
type restaurant
--------------------------
fast Freebirds
MamasPizza
family CafeEccell
LosNorte~s
WingsNmore
Fancy Christop~s
MadCows
Nested logit estimates
Levels = 2 Number of obs = 2072
Dependent variable = chosen LR chi2(10) = 199.7439
Log likelihood = -476.11744 Prob > chi2 = 0.0000
------------------------------------------------------------------------------
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
restaurant |
cost | -.089669 .031975 -2.80 0.005 -.1523389 -.0269991
rating | .1585362 .1275017 1.24 0.214 -.0913624 .4084349
distance | -.1712953 .0425201 -4.03 0.000 -.2546332 -.0879575
-------------+----------------------------------------------------------------
type |
incFast | -.0305267 .0120644 -2.53 0.011 -.0541724 -.006881
incFancy | .0452898 .0088836 5.10 0.000 .0278783 .0627014
kidFast | -.0750763 .1422322 -0.53 0.598 -.3538463 .2036938
kidFancy | -.3617426 .116892 -3.09 0.002 -.5908467 -.1326386
-------------+----------------------------------------------------------------
(incl. value |
parameters) |
type |
/fast | 6.228289 2.541234 2.45 0.014 1.247562 11.20902
/family | 1.759751 1.185935 1.48 0.138 -.5646376 4.08414
/Fancy | 1.479319 .4055198 3.65 0.000 .6845151 2.274124
------------------------------------------------------------------------------
LR test of homoskedasticity (iv = 1): chi2(3)= 10.44 Prob > chi2 = 0.0151
------------------------------------------------------------------------------
We see that the sample size is now lower by 28 observations due to
the seven records with missing values for rating corresponding to each of the
four families for which the values of this variable were modified. A similar
situation occurs if we eliminate the information on the dependent variable for
a group of families; nlogit drops those families from the estimation
sample.
However, if we eliminate the information corresponding to the variable rating
for some individuals, but this time just for one of the options in the
bottom level, we get the unbalanced-data error because we are
effectively changing the design by stating that some individuals will not
reach the bottom level. Look at the code below:
replace rating=. if family_id==25 & typ==3 | ///
family_id==50 & typ==3 | ///
family_id==75 & typ==3
nlogit chosen (restaurant=cost rating distance) ///
(type= incFast incFancy kidFast kidFancy), ///
group(family_id) nolog
tree structure specified for the nested logit model
top --> bottom
type restaurant
--------------------------
Fast Freebirds
MamasPizza
Family CafeEccell
LosNorte~s
WingsNmore
Fancy Christop~s
MadCows
unbalanced data
r(459);
For the final example, we erase the full set of observations corresponding to
one option of the bottom level. In this case nlogit performs the
estimation since the dataset will correspond to a new design without the
corresponding branch. See the example below:
replace rating=. if type==3
This implies that the correct design is now
Dining
/ \
/ \
/ \
/ \
/ \
/ \
Fast Food Family
/ \ / | \
/ \ / | \
/ \ / | \
M F W L C
P B M N E
In this case, using nlogit is valid again.
nlogit chosen (restaurant=cost rating distance) ///
(type= incFast incFancy kidFast kidFancy), ///
group(family_id) nolog
tree structure specified for the nested logit model
top --> bottom
type restaurant
--------------------------
Fast Freebirds
MamasPizza
Family CafeEccell
LosNorte~s
WingsNmore
Fancy Christop~s
MadCows
note: 51 groups (255 obs) dropped due to no positive outcome
or multiple positive outcomes per group
note: incFancy omitted due to no within-group variance
note: kidFancy omitted due to no within-group variance
Nested logit estimates
Levels = 2 Number of obs = 1245
Dependent variable = chosen LR chi2(7) = 125.731
Log likelihood = -337.88453 Prob > chi2 = 0.0000
------------------------------------------------------------------------------
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
restaurant |
cost | -.105763 .0469251 -2.25 0.024 -.1977345 -.0137915
rating | .1706296 .1425147 1.20 0.231 -.108694 .4499533
distance | -.1556858 .0606158 -2.57 0.010 -.2744905 -.0368811
-------------+----------------------------------------------------------------
type |
incFast | -.0289775 .012192 -2.38 0.017 -.0528734 -.0050816
kidFast | -.0774806 .1462401 -0.53 0.596 -.3641059 .2091447
-------------+----------------------------------------------------------------
(incl. value |
parameters) |
type |
/Fast | 5.702476 2.985886 1.91 0.056 -.1497534 11.55471
/Family | 1.958308 2.057901 0.95 0.341 -2.075104 5.99172
------------------------------------------------------------------------------
LR test of homoskedasticity (iv = 1): chi2(2)= 6.86 Prob > chi2 = 0.0324
------------------------------------------------------------------------------
Now 255 observations have been lost due to the missing values for the branch
corresponding to Fancy restaurants, but the estimation is performed since no
information is missing for the other two branches.
label define leaf_lbl 1 "d" 2 "e" 3 "f" 4 "g" 5 "h"
label values leaf leaf_lbl
|