Statalist The Stata Listserver


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: Estimating a three-way fixed effects model in a large data set


From   Thomas Cornelißen <cornelissen@ewifo.uni-hannover.de>
To   <statalist@hsphsun2.harvard.edu>
Subject   st: Estimating a three-way fixed effects model in a large data set
Date   Thu, 28 Sep 2006 11:22:49 +0200

Dear all,
I have written a Sata ado file for estimating a fixed effects model with
person and firm effects in a large employer-employee data sets.
(Or more generally, a three-way error components model estimated by fixed
effects methods.)
If anyone would like to test this job please mail to my
email address and I'll be happy to share the program.

The job works by now but I call it a beta version because I might get useful
comments to improve it.
Best regards,
Thomas

For those who want to know more about it:
Last year I inquired on Statalist about ways to estimate a large number of
firm and person effects in a large linked employer-employee data set under
the thread "data set larger than RAM".

A common method is to include one of the effects (firm effect) by
including dummies and to sweep out the other effect (person effect) by the
within transformation (substracting group means). The problem in my case was
that the number of groups (e.g. firms) was too high to create a set of dummy
variables, especially if there are many observations, as Stata has to keep
the whole data set in the computer memory at the time of estimation.
For example, assuming that 4 bytes per element of the desing matrix are
needed,
2 million observations and 1000 firms require about 8 gigabytes.

I got some very useful comments and developed the idea of writing a Mata
routine to handle the problem. The starting point of my idea was that the
design matrix X is of dimension (N x K) but that the cross product matrices
X'X and X'y are only of dimension (K x K) and (K x 1).
Therefore, if one can create X'X without needing to create X, then much
memory can be saved. In the present case, large parts of X are firm dummy
variables, and the information which observation belongs to which firm is
stored in the firm ID variable. Therefore it is possible to create X'X
without
having to create all the firm dummies.
In the above example with 1000 firms, X'X requires only 4 megabytes.

I wrote a routine in Mata which does exactly that and carries out the least
squares estimation on that basis. This is interesting in cases where X would
be too large to be kept in memory, but X'X is not too large to be kept in
memory and to be inverted.
Whereas Stata has a limit of 11,000 for the number of regressors, Mata
has virtually no limit for the size of thematrix, so that it is possible to
estimate more than 11,000 firm effects with my program.
The routine also takes care of how many firm effects are identified by
taking into acocunt the pattern o fmovers between firms in the data.
A beta version is available for testing/using the job for anyone interested.

-------------------------------------------------------------------------
Thomas Cornelissen
Institute of Empirical Economic Research
University of Hannover, Germany
cornelissen@ewifo.uni-hannover.de
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/




© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index