Search
>> Home >> Resources & support >> FAQs >> Nested logit models
Note: This FAQ is for users of Stata 8, and older versions of Stata. It is not relevant for Stata 9 since nlogit in Stata 9 runs on datasets with unbalanced panels.

### Why do I get an "unbalanced data" error message when I run nlogit?

 Title Nested logit models Author Gustavo Sanchez, StataCorp Date October 2004

The data for nlogit must be laid out such that, for each observation, there is a record for each choice at the terminal nodes of the tree. If your data is not laid out this way, you will get the "unbalanced data" error.

Let's say that my tree looks like this:

                                      .
/ \
/   \
_l__ /     \__r_
/ | \         /  \
/  |  \       /    \
/   |   \     /      \
d    e    f   g        h

Suppose the group variable is named id and the response y. I could create a new variable, calling it mid, that labels the mid-level nodes using the nlogitgen command:
  . nlogitgen mid = leaf(l:d|e|f, r:g|h)

Thus the first two observations of the data might look something like this:
  rec      id     mid    leaf1   y      x1     x2	     x3
1       1      l      d      0      1      2.3     -1.0
2       1      l      e      0      1      3.3     -1.1
3       1      l      f      1      0      4.5       .
4       1      r      g      0      0      1.3     -2.3
5       1      r      h      0      0      5.5     -1.7
6       2      l      d      0      1      1.2     -2.0
7       2      l      e      0      1      4.0     -0.7
8       2      l      f      0      1      2.0     -1.0
9       2      r      g      1      .      5.1     -0.9
10       2      r      h      0      0      6.1     -0.8
.      .      .

Here each observation consumes 5 records since there are 5 leaf nodes.

If I have three covariates, x1-x3, the call to nlogit might look like this:

  . nlogit y (leaf = x1 x2) (mid = x3), group(id)

The most frequent cause for the unbalanced-data error is missing values in your covariates, as demonstrated in the data listing above. Variables x1 and x3 have missing values for records 3 and 9, respectively. nlogit drops those records from the analysis, thereby making the data incomplete.

The following examples use the dataset "restaurant" from the StataCorp website. The examples are refer to the tree structure below, which implies a first-level choice of having dinner at a fast food restaurant, at a family restaurant, or at a fancy restaurant. Then, once the type of restaurant is selected, the bottom level corresponds to the final decision about the specific restaurant chosen.

                                      Dining
/   \
/   |   \
/     |     \
/       |       \
/         |         \
/           |           \
Fast Food     Family       Fancy
/  \        / | \        /  \
/    \      /  |  \      /    \
/      \    /   |   \    /      \
M        F  W    L    C  C        M
P        B  M    N    E           C

The code below reproduces the example in [R] nlogit. The middle variable and a set of explanatory variables are generated, and then the nested logit model is estimated:
  clear
webuse restaurant
nlogitgen type=restaurant(Fast:Freebirds|MamasPizza,   ///
Family:CafeEccell|LosNortenos|WingsNmore,  ///
gen incFast =(type==1)*income
gen incFancy =(type==3)*income
gen kidFast =(type==1)*kids
gen kidFancy =(type==3)*kids
nlogit chosen (restaurant=cost rating distance) 	  ///
(type= incFast incFancy kidFast kidFancy),      ///
group(family_id) nolog

top --> bottom

type    restaurant
--------------------------
Fast     Freebirds
MamasPizza
Family    CafeEccell
LosNorte~s
WingsNmore
Fancy    Christop~s

Nested logit estimates
Levels             =          2                 Number of obs      =      2100
Dependent variable =     chosen                 LR chi2(10)        =  199.6293
Log likelihood     =  -483.9584                 Prob > chi2        =    0.0000
------------------------------------------------------------------------------
|      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
restaurant   |
cost |  -.0944352     .03402    -2.78   0.006    -.1611131   -.0277572
rating |   .1793759    .126895     1.41   0.157    -.0693338    .4280855
distance |  -.1745797   .0433352    -4.03   0.000    -.2595152   -.0896443
-------------+----------------------------------------------------------------
type         |
incFast |  -.0287502   .0116242    -2.47   0.013    -.0515332   -.0059672
incFancy |   .0458373   .0089109     5.14   0.000     .0283722    .0633024
kidFast |  -.0704164   .1394359    -0.51   0.614    -.3437058    .2028729
kidFancy |  -.3626381   .1171277    -3.10   0.002    -.5922041   -.1330721
-------------+----------------------------------------------------------------
(incl. value |
parameters) |
type         |
/fast |   5.715758   2.332871     2.45   0.014     1.143415     10.2881
/family |   1.721222   1.152002     1.49   0.135    -.5366608    3.979105
/Fancy |   1.466588   .4169075     3.52   0.000     .6494642    2.283711
------------------------------------------------------------------------------
LR test of homoskedasticity (iv = 1): chi2(3)=    9.90    Prob > chi2 = 0.0194
------------------------------------------------------------------------------

Using this nlogit model as the base for comparison, let's modify the data and check whether a problem arises in the estimation.

First, let's erase the information on the explanatory variable rating for four families:

  replace rating=.  if family_id==65  | family_id==146 |    ///
family_id==220 | family_id==285

Then, we will estimate the same nested logit model that we estimated above:
  nlogit chosen (restaurant=cost rating distance)           ///
(type= incFast incFancy kidFast kidFancy),  ///
group(family_id) nolog

tree structure specified for the nested logit model

top --> bottom

type    restaurant
--------------------------
fast     Freebirds
MamasPizza
family    CafeEccell
LosNorte~s
WingsNmore
Fancy    Christop~s

Nested logit estimates
Levels             =          2                 Number of obs      =      2072
Dependent variable =     chosen                 LR chi2(10)        =  199.7439
Log likelihood     = -476.11744                 Prob > chi2        =    0.0000

------------------------------------------------------------------------------
|      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
restaurant   |
cost |   -.089669    .031975    -2.80   0.005    -.1523389   -.0269991
rating |   .1585362   .1275017     1.24   0.214    -.0913624    .4084349
distance |  -.1712953   .0425201    -4.03   0.000    -.2546332   -.0879575
-------------+----------------------------------------------------------------
type         |
incFast |  -.0305267   .0120644    -2.53   0.011    -.0541724    -.006881
incFancy |   .0452898   .0088836     5.10   0.000     .0278783    .0627014
kidFast |  -.0750763   .1422322    -0.53   0.598    -.3538463    .2036938
kidFancy |  -.3617426    .116892    -3.09   0.002    -.5908467   -.1326386
-------------+----------------------------------------------------------------
(incl. value |
parameters) |
type         |
/fast |   6.228289   2.541234     2.45   0.014     1.247562    11.20902
/family |   1.759751   1.185935     1.48   0.138    -.5646376     4.08414
/Fancy |   1.479319   .4055198     3.65   0.000     .6845151    2.274124
------------------------------------------------------------------------------
LR test of homoskedasticity (iv = 1): chi2(3)=   10.44    Prob > chi2 = 0.0151
------------------------------------------------------------------------------

We see that the sample size is now lower by 28 observations due to the seven records with missing values for rating corresponding to each of the four families for which the values of this variable were modified. A similar situation occurs if we eliminate the information on the dependent variable for a group of families; nlogit drops those families from the estimation sample.

However, if we eliminate the information corresponding to the variable rating for some individuals, but this time just for one of the options in the bottom level, we get the unbalanced-data error because we are effectively changing the design by stating that some individuals will not reach the bottom level. Look at the code below:

  replace rating=.   if family_id==25 & typ==3 |        ///
family_id==50 & typ==3 |        ///
family_id==75 & typ==3

nlogit chosen (restaurant=cost rating distance)           ///
(type= incFast incFancy kidFast kidFancy),  ///
group(family_id) nolog

tree structure specified for the nested logit model

top --> bottom

type    restaurant
--------------------------
Fast     Freebirds
MamasPizza
Family    CafeEccell
LosNorte~s
WingsNmore
Fancy    Christop~s
unbalanced data
r(459);

For the final example, we erase the full set of observations corresponding to one option of the bottom level. In this case nlogit performs the estimation since the dataset will correspond to a new design without the corresponding branch. See the example below:
  replace rating=. if type==3

This implies that the correct design is now
                                  Dining
/   \
/     \
/       \
/         \
/           \
/             \
Fast Food         Family
/  \           / | \
/    \         /  |  \
/      \       /   |   \
M        F     W    L    C
P        B     M    N    E

In this case, using nlogit is valid again.
  nlogit chosen (restaurant=cost rating distance)           ///
(type= incFast incFancy kidFast kidFancy),  ///
group(family_id) nolog

  tree structure specified for the nested logit model

top --> bottom

type    restaurant
--------------------------
Fast     Freebirds
MamasPizza
Family    CafeEccell
LosNorte~s
WingsNmore
Fancy    Christop~s
note: 51 groups (255 obs) dropped due to no positive outcome
or multiple positive outcomes per group
note: incFancy omitted due to no within-group variance
note: kidFancy omitted due to no within-group variance

Nested logit estimates
Levels             =          2                 Number of obs      =      1245
Dependent variable =     chosen                 LR chi2(7)         =   125.731
Log likelihood     = -337.88453                 Prob > chi2        =    0.0000

------------------------------------------------------------------------------
|      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
restaurant   |
cost |   -.105763   .0469251    -2.25   0.024    -.1977345   -.0137915
rating |   .1706296   .1425147     1.20   0.231     -.108694    .4499533
distance |  -.1556858   .0606158    -2.57   0.010    -.2744905   -.0368811
-------------+----------------------------------------------------------------
type         |
incFast |  -.0289775    .012192    -2.38   0.017    -.0528734   -.0050816
kidFast |  -.0774806   .1462401    -0.53   0.596    -.3641059    .2091447
-------------+----------------------------------------------------------------
(incl. value |
parameters) |
type         |
/Fast |   5.702476   2.985886     1.91   0.056    -.1497534    11.55471
/Family |   1.958308   2.057901     0.95   0.341    -2.075104     5.99172
------------------------------------------------------------------------------
LR test of homoskedasticity (iv = 1): chi2(2)=    6.86    Prob > chi2 = 0.0324
------------------------------------------------------------------------------

Now 255 observations have been lost due to the missing values for the branch corresponding to Fancy restaurants, but the estimation is performed since no information is missing for the other two branches.

1 Notice that the labels of the leaf variable are listed here. The values of the leaf variable would be 1 2 3 4 5 1 2 3 4 5. Thus you need to define the label and assign it to the leaf variable:

  label define leaf_lbl 1 "d" 2 "e" 3 "f" 4 "g" 5 "h"
label values leaf leaf_lbl