Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down at the end of May, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: reshape a large wide longitudinal data set to long


From   Phil Schumm <pschumm@uchicago.edu>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: reshape a large wide longitudinal data set to long
Date   Mon, 14 Jun 2010 21:47:55 -0500

On Jun 14, 2010, at 6:09 PM, Amanda Fu wrote:
I am working on a wide version of longitudinal set : about 10000 observations, 2000 variables totally for all the 10 years, data set size : 113,440,212. It is wide because teh original data set is wide. Now I would like to reshape it into a long version, since most analysis can be done in long version. But it is not surprising the --reshape---command can not be done because the data set is too large. Stata suggests me to either increase memory, or to drop variables or observations.

The thing is, I have not finally decided what variables are going to be used in the following analysis yet . If I drop variables, it will cause the inconvenience that I might have to reshape again and again to add variables in the long version. Definitely I will not try to use all the 2000 variables in the analysis, but it is still painful to re-do the reshaping thing to add extra variables.


How you go about this really depends on information you haven't given us. To take the simplest case, let's suppose that you have data on a set of items (e.g., people) and for each of these, you have observations for up to 10 years. You can then split your variables into 3 groups:

    1) X_i, describing item i (i.e., constant within i)
    2) Y_j, describing year j (i.e., constant within j)
    3) Z_ij, describing item i in year j

How you proceed will depend on how many variables you have in each of these three categories.

For example, suppose all your variables are in category (3). If your wide format file has 10,000 observations and 2000 variables, this would imply that the length of Z_ij is roughly 200. Thus, in long format, this would be a dataset with 100,000 observations and approximately 200 variables. If we (rather conservatively) assume that all of these variables are float, we are talking about roughly 77MB of data. Not very large by today's standards.

Now, suppose instead that half of your variables (i.e., 1,000) are in category (1), with the remainder in category (3). In this case, putting the entire dataset in wide format (again, assuming all variables are float) would require 420MB. This is because you are storing multiple copies of X_i -- one for each year in which i was observed. In the language of relational databases, we say that the data are not "normalized."

In this case, it would be more efficient to keep two files: one file containing the variables in (3) in long format, and a second file containing the variables in (1). When it comes to performing a specific analysis, you can then grab the variables you need from each and combine them via -merge-, which is pretty quick. Of course, a similar argument would apply if you have variables in category (2) (i.e., you'd have a third file with these variables).

Finally, if you're running out of memory manipulating your data, make sure that you are using the most compact storage types possible (i.e., use -compress-), and make sure that you are not storing any variables as strings that could be stored as labeled integers. Also, you can increase Stata's memory using -set memory-, and should consider investing in some additional physical memory, if necessary.


I was thinking a clumsy way: I break down the original wide data set into several small wide data sets , reshape them separately, and then append all the small long data sets together. Is this way OK?


If necessary, you could certainly do this. However, I would break up your wide file so that you take one or more variables for all years together. Combining the resulting long files will then involve merging rather than appending.


-- Phil

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index