[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
st: Estimating a three-way fixed effects model in a large data set
I have written a Sata ado file for estimating a fixed effects model with
person and firm effects in a large employer-employee data sets.
(Or more generally, a three-way error components model estimated by fixed
If anyone would like to test this job please mail to my
email address and I'll be happy to share the program.
The job works by now but I call it a beta version because I might get useful
comments to improve it.
For those who want to know more about it:
Last year I inquired on Statalist about ways to estimate a large number of
firm and person effects in a large linked employer-employee data set under
the thread "data set larger than RAM".
A common method is to include one of the effects (firm effect) by
including dummies and to sweep out the other effect (person effect) by the
within transformation (substracting group means). The problem in my case was
that the number of groups (e.g. firms) was too high to create a set of dummy
variables, especially if there are many observations, as Stata has to keep
the whole data set in the computer memory at the time of estimation.
For example, assuming that 4 bytes per element of the desing matrix are
2 million observations and 1000 firms require about 8 gigabytes.
I got some very useful comments and developed the idea of writing a Mata
routine to handle the problem. The starting point of my idea was that the
design matrix X is of dimension (N x K) but that the cross product matrices
X'X and X'y are only of dimension (K x K) and (K x 1).
Therefore, if one can create X'X without needing to create X, then much
memory can be saved. In the present case, large parts of X are firm dummy
variables, and the information which observation belongs to which firm is
stored in the firm ID variable. Therefore it is possible to create X'X
having to create all the firm dummies.
In the above example with 1000 firms, X'X requires only 4 megabytes.
I wrote a routine in Mata which does exactly that and carries out the least
squares estimation on that basis. This is interesting in cases where X would
be too large to be kept in memory, but X'X is not too large to be kept in
memory and to be inverted.
Whereas Stata has a limit of 11,000 for the number of regressors, Mata
has virtually no limit for the size of thematrix, so that it is possible to
estimate more than 11,000 firm effects with my program.
The routine also takes care of how many firm effects are identified by
taking into acocunt the pattern o fmovers between firms in the data.
A beta version is available for testing/using the job for anyone interested.
Institute of Empirical Economic Research
University of Hannover, Germany
* For searches and help try: