Re: st: reshape a large wide longitudinal data set to long
Mon, 14 Jun 2010 21:47:55 -0500
On Jun 14, 2010, at 6:09 PM, Amanda Fu wrote:
I am working on a wide version of longitudinal set : about 10000
observations, 2000 variables totally for all the 10 years, data set
size : 113,440,212. It is wide because teh original data set is
wide. Now I would like to reshape it into a long version, since
most analysis can be done in long version. But it is not surprising
the --reshape---command can not be done because the data set is too
large. Stata suggests me to either increase memory, or to drop
variables or observations.
The thing is, I have not finally decided what variables are going to
be used in the following analysis yet . If I drop variables, it
will cause the inconvenience that I might have to reshape again and
again to add variables in the long version. Definitely I will not
try to use all the 2000 variables in the analysis, but it is still
painful to re-do the reshaping thing to add extra variables.
How you go about this really depends on information you haven't given
us. To take the simplest case, let's suppose that you have data on a
set of items (e.g., people) and for each of these, you have
observations for up to 10 years. You can then split your variables
into 3 groups:
1) X_i, describing item i (i.e., constant within i)
2) Y_j, describing year j (i.e., constant within j)
3) Z_ij, describing item i in year j
How you proceed will depend on how many variables you have in each of
these three categories.
For example, suppose all your variables are in category (3). If your
wide format file has 10,000 observations and 2000 variables, this
would imply that the length of Z_ij is roughly 200. Thus, in long
format, this would be a dataset with 100,000 observations and
approximately 200 variables. If we (rather conservatively) assume
that all of these variables are float, we are talking about roughly
77MB of data. Not very large by today's standards.
Now, suppose instead that half of your variables (i.e., 1,000) are in
category (1), with the remainder in category (3). In this case,
putting the entire dataset in wide format (again, assuming all
variables are float) would require 420MB. This is because you are
storing multiple copies of X_i -- one for each year in which i was
observed. In the language of relational databases, we say that the
data are not "normalized."
In this case, it would be more efficient to keep two files: one file
containing the variables in (3) in long format, and a second file
containing the variables in (1). When it comes to performing a
specific analysis, you can then grab the variables you need from each
and combine them via -merge-, which is pretty quick. Of course, a
similar argument would apply if you have variables in category (2)
(i.e., you'd have a third file with these variables).
Finally, if you're running out of memory manipulating your data, make
sure that you are using the most compact storage types possible (i.e.,
use -compress-), and make sure that you are not storing any variables
as strings that could be stored as labeled integers. Also, you can
increase Stata's memory using -set memory-, and should consider
investing in some additional physical memory, if necessary.
I was thinking a clumsy way: I break down the original wide data set
into several small wide data sets , reshape them separately, and
then append all the small long data sets together. Is this way OK?
If necessary, you could certainly do this. However, I would break up
your wide file so that you take one or more variables for all years
together. Combining the resulting long files will then involve
merging rather than appending.