Home  /  Resources & support  /  FAQs  /  Nested logit models
Note: This FAQ is for users of Stata 8 and older versions of Stata.

It is not relevant for Stata 9 since nlogit in Stata 9 runs on datasets with unbalanced panels.

Why do I get an "unbalanced data" error message when I run nlogit?

Title   Nested logit models
Author Gustavo Sanchez, StataCorp

The data for nlogit must be laid out such that, for each observation, there is a record for each choice at the terminal nodes of the tree. If your data is not laid out this way, you will get the "unbalanced data" error.

Let's say that my tree looks like this:

                                      .
                                     / \
                                    /   \
                              _l__ /     \__r_
                             / | \         /  \
                            /  |  \       /    \
                           /   |   \     /      \ 
                          d    e    f   g        h
Suppose the group variable is named id and the response y. I could create a new variable, calling it mid, that labels the mid-level nodes using the nlogitgen command:
  . nlogitgen mid = leaf(l:d|e|f, r:g|h)
Thus the first two observations of the data might look something like this:
  rec      id     mid    leaf1   y      x1     x2	     x3
   1       1      l      d      0      1      2.3     -1.0 
   2       1      l      e      0      1      3.3     -1.1
   3       1      l      f      1      0      4.5       .
   4       1      r      g      0      0      1.3     -2.3
   5       1      r      h      0      0      5.5     -1.7
   6       2      l      d      0      1      1.2     -2.0
   7       2      l      e      0      1      4.0     -0.7
   8       2      l      f      0      1      2.0     -1.0
   9       2      r      g      1      .      5.1     -0.9
  10       2      r      h      0      0      6.1     -0.8
                               .      .      .
Here each observation consumes 5 records since there are 5 leaf nodes.

If I have three covariates, x1-x3, the call to nlogit might look like this:

  . nlogit y (leaf = x1 x2) (mid = x3), group(id)
The most frequent cause for the unbalanced-data error is missing values in your covariates, as demonstrated in the data listing above. Variables x1 and x3 have missing values for records 3 and 9, respectively. nlogit drops those records from the analysis, thereby making the data incomplete.

The following examples use the dataset "restaurant" from the StataCorp website. The examples are refer to the tree structure below, which implies a first-level choice of having dinner at a fast food restaurant, at a family restaurant, or at a fancy restaurant. Then, once the type of restaurant is selected, the bottom level corresponds to the final decision about the specific restaurant chosen.

                                      Dining
                                      /   \
                                    /   |   \
                                  /     |     \
                                /       |       \
                              /         |         \
                            /           |           \
                        Fast Food     Family       Fancy
                          /  \        / | \        /  \
                         /    \      /  |  \      /    \
                        /      \    /   |   \    /      \ 
                       M        F  W    L    C  C        M
                       P        B  M    N    E           C
The code below reproduces the example in [CM] nlogit. The middle variable and a set of explanatory variables are generated, and then the nested logit model is estimated:
  clear
  webuse restaurant
   nlogitgen type=restaurant(Fast:Freebirds|MamasPizza,   ///
               Family:CafeEccell|LosNortenos|WingsNmore,  ///
               Fancy: Christophers|MadCows)			
   gen incFast =(type==1)*income
   gen incFancy =(type==3)*income
   gen kidFast =(type==1)*kids
   gen kidFancy =(type==3)*kids
   nlogit chosen (restaurant=cost rating distance) 	  ///
          (type= incFast incFancy kidFast kidFancy),      ///
           group(family_id) nolog
  
   
  top --> bottom
  
          type    restaurant  
  --------------------------
          Fast     Freebirds  
                  MamasPizza  
        Family    CafeEccell  
                  LosNorte~s  
                  WingsNmore  
         Fancy    Christop~s  
                     MadCows  
  
  Nested logit estimates
  Levels             =          2                 Number of obs      =      2100
  Dependent variable =     chosen                 LR chi2(10)        =  199.6293
  Log likelihood     =  -483.9584                 Prob > chi2        =    0.0000
  ------------------------------------------------------------------------------
               |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
  -------------+----------------------------------------------------------------
  restaurant   |
          cost |  -.0944352     .03402    -2.78   0.006    -.1611131   -.0277572
        rating |   .1793759    .126895     1.41   0.157    -.0693338    .4280855
      distance |  -.1745797   .0433352    -4.03   0.000    -.2595152   -.0896443
  -------------+----------------------------------------------------------------
  type         |
       incFast |  -.0287502   .0116242    -2.47   0.013    -.0515332   -.0059672
      incFancy |   .0458373   .0089109     5.14   0.000     .0283722    .0633024
       kidFast |  -.0704164   .1394359    -0.51   0.614    -.3437058    .2028729
      kidFancy |  -.3626381   .1171277    -3.10   0.002    -.5922041   -.1330721
  -------------+----------------------------------------------------------------
  (incl. value |
   parameters) |
  type         |
         /fast |   5.715758   2.332871     2.45   0.014     1.143415     10.2881
       /family |   1.721222   1.152002     1.49   0.135    -.5366608    3.979105
        /Fancy |   1.466588   .4169075     3.52   0.000     .6494642    2.283711
  ------------------------------------------------------------------------------
  LR test of homoskedasticity (iv = 1): chi2(3)=    9.90    Prob > chi2 = 0.0194
  ------------------------------------------------------------------------------
Using this nlogit model as the base for comparison, let's modify the data and check whether a problem arises in the estimation.

First, let's erase the information on the explanatory variable rating for four families:

  replace rating=.  if family_id==65  | family_id==146 |    ///
                     family_id==220 | family_id==285
Then, we will estimate the same nested logit model that we estimated above:
  nlogit chosen (restaurant=cost rating distance)           ///
                (type= incFast incFancy kidFast kidFancy),  ///
                 group(family_id) nolog

  tree structure specified for the nested logit model

  top --> bottom

          type    restaurant  
  --------------------------
          fast     Freebirds  
                  MamasPizza  
        family    CafeEccell  
                  LosNorte~s  
                  WingsNmore  
         Fancy    Christop~s  
                     MadCows  
  
  Nested logit estimates
  Levels             =          2                 Number of obs      =      2072
  Dependent variable =     chosen                 LR chi2(10)        =  199.7439
  Log likelihood     = -476.11744                 Prob > chi2        =    0.0000
  
  ------------------------------------------------------------------------------
               |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
  -------------+----------------------------------------------------------------
  restaurant   |
          cost |   -.089669    .031975    -2.80   0.005    -.1523389   -.0269991
        rating |   .1585362   .1275017     1.24   0.214    -.0913624    .4084349
      distance |  -.1712953   .0425201    -4.03   0.000    -.2546332   -.0879575
  -------------+----------------------------------------------------------------
  type         |
       incFast |  -.0305267   .0120644    -2.53   0.011    -.0541724    -.006881
      incFancy |   .0452898   .0088836     5.10   0.000     .0278783    .0627014
       kidFast |  -.0750763   .1422322    -0.53   0.598    -.3538463    .2036938
      kidFancy |  -.3617426    .116892    -3.09   0.002    -.5908467   -.1326386
  -------------+----------------------------------------------------------------
  (incl. value |
   parameters) |
  type         |
         /fast |   6.228289   2.541234     2.45   0.014     1.247562    11.20902
       /family |   1.759751   1.185935     1.48   0.138    -.5646376     4.08414
        /Fancy |   1.479319   .4055198     3.65   0.000     .6845151    2.274124
  ------------------------------------------------------------------------------
  LR test of homoskedasticity (iv = 1): chi2(3)=   10.44    Prob > chi2 = 0.0151
  ------------------------------------------------------------------------------
We see that the sample size is now lower by 28 observations due to the seven records with missing values for rating corresponding to each of the four families for which the values of this variable were modified. A similar situation occurs if we eliminate the information on the dependent variable for a group of families; nlogit drops those families from the estimation sample.

However, if we eliminate the information corresponding to the variable rating for some individuals, but this time just for one of the options in the bottom level, we get the unbalanced-data error because we are effectively changing the design by stating that some individuals will not reach the bottom level. Look at the code below:

  replace rating=.   if family_id==25 & typ==3 |        ///
                        family_id==50 & typ==3 |        ///
                        family_id==75 & typ==3
        
  nlogit chosen (restaurant=cost rating distance)           ///
                (type= incFast incFancy kidFast kidFancy),  ///
                 group(family_id) nolog
        
  tree structure specified for the nested logit model
  
  top --> bottom
  
          type    restaurant  
  --------------------------
          Fast     Freebirds  
                  MamasPizza  
        Family    CafeEccell  
                  LosNorte~s  
                  WingsNmore  
         Fancy    Christop~s  
                     MadCows  
  unbalanced data
  r(459);
For the final example, we erase the full set of observations corresponding to one option of the bottom level. In this case nlogit performs the estimation since the dataset will correspond to a new design without the corresponding branch. See the example below:
  replace rating=. if type==3
This implies that the correct design is now
                                  Dining
                                  /   \
                                 /     \
                                /       \
                               /         \
                              /           \
                             /             \
                        Fast Food         Family       
                           /  \           / | \    	
                          /    \         /  |  \         
                         /      \       /   |   \       
                        M        F     W    L    C     
                        P        B     M    N    E    
In this case, using nlogit is valid again.
  nlogit chosen (restaurant=cost rating distance)           ///
                (type= incFast incFancy kidFast kidFancy),  ///
                 group(family_id) nolog
  tree structure specified for the nested logit model
  
          top --> bottom
  
          type    restaurant  
  --------------------------
          Fast     Freebirds  
                  MamasPizza  
        Family    CafeEccell  
                  LosNorte~s  
                  WingsNmore  
         Fancy    Christop~s  
                     MadCows  
  note: 51 groups (255 obs) dropped due to no positive outcome
        or multiple positive outcomes per group
  note: incFancy omitted due to no within-group variance
  note: kidFancy omitted due to no within-group variance
  
  Nested logit estimates
  Levels             =          2                 Number of obs      =      1245
  Dependent variable =     chosen                 LR chi2(7)         =   125.731
  Log likelihood     = -337.88453                 Prob > chi2        =    0.0000
  
  ------------------------------------------------------------------------------
               |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
  -------------+----------------------------------------------------------------
  restaurant   |
          cost |   -.105763   .0469251    -2.25   0.024    -.1977345   -.0137915
        rating |   .1706296   .1425147     1.20   0.231     -.108694    .4499533
      distance |  -.1556858   .0606158    -2.57   0.010    -.2744905   -.0368811
  -------------+----------------------------------------------------------------
  type         |
       incFast |  -.0289775    .012192    -2.38   0.017    -.0528734   -.0050816
       kidFast |  -.0774806   .1462401    -0.53   0.596    -.3641059    .2091447
  -------------+----------------------------------------------------------------
  (incl. value |
   parameters) |
  type         |
         /Fast |   5.702476   2.985886     1.91   0.056    -.1497534    11.55471
       /Family |   1.958308   2.057901     0.95   0.341    -2.075104     5.99172
  ------------------------------------------------------------------------------
  LR test of homoskedasticity (iv = 1): chi2(2)=    6.86    Prob > chi2 = 0.0324
  ------------------------------------------------------------------------------
Now 255 observations have been lost due to the missing values for the branch corresponding to Fancy restaurants, but the estimation is performed since no information is missing for the other two branches.


1 Notice that the labels of the leaf variable are listed here. The values of the leaf variable would be 1 2 3 4 5 1 2 3 4 5. Thus you need to define the label and assign it to the leaf variable:

  label define leaf_lbl 1 "d" 2 "e" 3 "f" 4 "g" 5 "h"
  label values leaf leaf_lbl