FAQ: Logistic regression with aggregated data

Home / Resources & support / FAQs / Logistic regression with aggregated data

How can I do logistic regression or multinomial logistic regression with aggregated data?

Title		Logistic regression with aggregated data
Author		William Sribney, StataCorp

One way to do this is to first rearrange your data so you can use frequency weights (fweights) with the logistic, logit, or mlogit command.

For binary outcomes, one can also use glm with family(binomial varnameN) and link(logit), where varnameN is a variable that stores the total number of trials for each observation. However, rearranging the data for use with frequency weights also covers the more general case of multinomial outcomes.

It is easier to explain with an example. First, consider the following binary-outcome data:

 . list



   cases   total   x1   x2 


   1.     23     123    0    0 

   2.     12     234    0    1 

   3.     56     248    1    0 

   4.     81     390    1    1

In the above dataset, the variable cases contains the number of observations out of total with positive outcomes. For example, in the first line there are 23 observations that are positive and 100 observations that are zero with x1 = 0 and x2 = 0; the total number of observations with x1 = 0 and x2 = 0 is 123.

To use logistic and logit with fweights, the data need to be rearranged such that we have one observation per response category:

 . list , sep(0)



     w   y   x1   x2 


   1.  100   0    0    0 

   2.   23   1    0    0 

   3.  222   0    0    1 

   4.   12   1    0    1 

   5.  192   0    1    0 


   6.   56   1    1    0 

   7.  309   0    1    1 

   8.   81   1    1    1

In this dataset, y is the outcome and w is the frequency number.

You can then run commands such as

 . logit y x1 x2 [fw=w]

We could fit the same model using the glm command:

 . glm cases x1 x2, family(binomial total) link(logit)

This glm specification gives the same answer as the logit command with the rearranged data. However, logit or logistic have advantages in that one can run other commands afterward like estat gof.

To rearrange the data from the first format to the second format, you can use the reshape command.

Here is how you do it for this example:

. input cases total x1 x2

         cases      total         x1         x2
  1. 23 123 0 0
  2. 12 234 0 0
  3. 56 248 1 0
  4. 81 390 1 1
  5. end

. 
. *rearrange
. generate w0 = total - cases

. drop total

. rename cases w1

. generate id=_n

. reshape long w, i(id) j(y)
(note: j = 0 1)


Data                               wide   ->   long
         
Number of obs.                        4   ->       8
Number of variables                   5   ->       5
j variable (2 values)                     ->   y
xij variables:
                                  w0 w1   ->   w


. drop id

The categories (0, 1, i.e., the suffixes of w) will appear in the variable y. The frequency weights will be given in the new variable w.

Then one can do the regression like

 . logit y x1 x2 [fw=w]
 . mlogit y <covariates> [fw=w]
 
 etc....

We use cookies

We use cookies to ensure that we give you the best experience on our website—to enhance site navigation, to analyze usage, and to assist in our marketing efforts. By continuing to use our site, you consent to the storing of cookies on your device and agree to delivery of content, including web fonts and JavaScript, from third party web services.

Cookie Settings

Last updated: 16 November 2022

StataCorp LLC (StataCorp) strives to provide our users with exceptional products and services. To do so, we must collect personal information from you. This information is necessary to conduct business with our existing and potential customers. We collect and use this information only where we may legally do so. This policy explains what personal information we collect, how we use it, and what rights you have to that information.

Advertising and performance cookies

This website uses cookies to provide you with a better user experience. A cookie is a small piece of data our website stores on a site visitor's hard drive and accesses each time you visit so we can improve your access to our site, better understand how you use our site, and serve you content that may be of interest to you. For instance, we store a cookie when you log in to our shopping cart so that we can maintain your shopping cart should you not complete checkout. These cookies do not directly store your personal information, but they do support the ability to uniquely identify your internet browser and device.

Please note: Clearing your browser cookies at any time will undo preferences saved here. The option selected here will apply only to the device you are currently using.

How can I do logistic regression or multinomial logistic regression with aggregated data?

We use cookies

Privacy policy

Required cookies

Advertising and performance cookies


		cases total x1 x2

1.		23 123 0 0
2.		12 234 0 1
3.		56 248 1 0
4.		81 390 1 1


		w y x1 x2

1.		100 0 0 0
2.		23 1 0 0
3.		222 0 0 1
4.		12 1 0 1
5.		192 0 1 0

6.		56 1 1 0
7.		309 0 1 1
8.		81 1 1 1

Data wide -> long

Number of obs. 4 -> 8 Number of variables 5 -> 5 j variable (2 values) -> y xij variables: w0 w1 -> w

Stata/MP4 Annual License (download)

How can I do logistic regression or multinomial logistic regression with aggregated data?

We use cookies

Privacy policy

Required cookies

Advertising and performance cookies