Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: reshape a large wide longitudinal data set to long
From 
 
Phil Schumm <[email protected]> 
To 
 
[email protected] 
Subject 
 
Re: st: reshape a large wide longitudinal data set to long 
Date 
 
Mon, 14 Jun 2010 21:47:55 -0500 
On Jun 14, 2010, at 6:09 PM, Amanda Fu wrote:
I am working on a wide version of longitudinal set : about 10000  
observations, 2000 variables totally for all the 10 years, data set  
size :  113,440,212.  It is wide because teh original  data set is  
wide.  Now I would like to reshape it into a long version, since  
most analysis can be done in long version. But it is not surprising   
the --reshape---command can not be done because the data set is too  
large. Stata suggests me to either increase memory, or to drop  
variables or observations.
The thing is, I have not finally decided what variables are going to  
be used in the following analysis yet .  If I drop variables, it  
will cause the inconvenience that I might have to reshape again and  
again to add variables in the long version. Definitely I will not  
try to use all the 2000 variables in the analysis, but it is still  
painful to re-do the reshaping thing to add extra variables.
How you go about this really depends on information you haven't given  
us.  To take the simplest case, let's suppose that you have data on a  
set of items (e.g., people) and for each of these, you have  
observations for up to 10 years.  You can then split your variables  
into 3 groups:
    1) X_i, describing item i (i.e., constant within i)
    2) Y_j, describing year j (i.e., constant within j)
    3) Z_ij, describing item i in year j
How you proceed will depend on how many variables you have in each of  
these three categories.
For example, suppose all your variables are in category (3).  If your  
wide format file has 10,000 observations and 2000 variables, this  
would imply that the length of Z_ij is roughly 200.  Thus, in long  
format, this would be a dataset with 100,000 observations and  
approximately 200 variables.  If we (rather conservatively) assume  
that all of these variables are float, we are talking about roughly  
77MB of data.  Not very large by today's standards.
Now, suppose instead that half of your variables (i.e., 1,000) are in  
category (1), with the remainder in category (3).  In this case,  
putting the entire dataset in wide format (again, assuming all  
variables are float) would require 420MB.  This is because you are  
storing multiple copies of X_i -- one for each year in which i was  
observed.  In the language of relational databases, we say that the  
data are not "normalized."
In this case, it would be more efficient to keep two files: one file  
containing the variables in (3) in long format, and a second file  
containing the variables in (1).  When it comes to performing a  
specific analysis, you can then grab the variables you need from each  
and combine them via -merge-, which is pretty quick.  Of course, a  
similar argument would apply if you have variables in category (2)  
(i.e., you'd have a third file with these variables).
Finally, if you're running out of memory manipulating your data, make  
sure that you are using the most compact storage types possible (i.e.,  
use -compress-), and make sure that you are not storing any variables  
as strings that could be stored as labeled integers.  Also, you can  
increase Stata's memory using -set memory-, and should consider  
investing in some additional physical memory, if necessary.
I was thinking a clumsy way: I break down the original wide data set  
into several small wide data sets , reshape them separately, and  
then append all the small long data sets together. Is this way OK?
If necessary, you could certainly do this.  However, I would break up  
your wide file so that you take one or more variables for all years  
together.  Combining the resulting long files will then involve  
merging rather than appending.
-- Phil
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/