Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: reshape a large wide longitudinal data set to long

From	Phil Schumm <[email protected]>
To	[email protected]
Subject	Re: st: reshape a large wide longitudinal data set to long
Date	Mon, 14 Jun 2010 21:47:55 -0500

On Jun 14, 2010, at 6:09 PM, Amanda Fu wrote:

I am working on a wide version of longitudinal set : about 10000observations, 2000 variables totally for all the 10 years, data setsize : 113,440,212. It is wide because teh original data set iswide. Now I would like to reshape it into a long version, sincemost analysis can be done in long version. But it is not surprisingthe --reshape---command can not be done because the data set is toolarge. Stata suggests me to either increase memory, or to dropvariables or observations.
The thing is, I have not finally decided what variables are going tobe used in the following analysis yet . If I drop variables, itwill cause the inconvenience that I might have to reshape again andagain to add variables in the long version. Definitely I will nottry to use all the 2000 variables in the analysis, but it is stillpainful to re-do the reshaping thing to add extra variables.

How you go about this really depends on information you haven't givenus. To take the simplest case, let's suppose that you have data on aset of items (e.g., people) and for each of these, you haveobservations for up to 10 years. You can then split your variablesinto 3 groups:


    1) X_i, describing item i (i.e., constant within i)
    2) Y_j, describing year j (i.e., constant within j)
    3) Z_ij, describing item i in year j

How you proceed will depend on how many variables you have in each ofthese three categories.

For example, suppose all your variables are in category (3). If yourwide format file has 10,000 observations and 2000 variables, thiswould imply that the length of Z_ij is roughly 200. Thus, in longformat, this would be a dataset with 100,000 observations andapproximately 200 variables. If we (rather conservatively) assumethat all of these variables are float, we are talking about roughly77MB of data. Not very large by today's standards.

Now, suppose instead that half of your variables (i.e., 1,000) are incategory (1), with the remainder in category (3). In this case,putting the entire dataset in wide format (again, assuming allvariables are float) would require 420MB. This is because you arestoring multiple copies of X_i -- one for each year in which i wasobserved. In the language of relational databases, we say that thedata are not "normalized."

In this case, it would be more efficient to keep two files: one filecontaining the variables in (3) in long format, and a second filecontaining the variables in (1). When it comes to performing aspecific analysis, you can then grab the variables you need from eachand combine them via -merge-, which is pretty quick. Of course, asimilar argument would apply if you have variables in category (2)(i.e., you'd have a third file with these variables).

Finally, if you're running out of memory manipulating your data, makesure that you are using the most compact storage types possible (i.e.,use -compress-), and make sure that you are not storing any variablesas strings that could be stored as labeled integers. Also, you canincrease Stata's memory using -set memory-, and should considerinvesting in some additional physical memory, if necessary.

I was thinking a clumsy way: I break down the original wide data setinto several small wide data sets , reshape them separately, andthen append all the small long data sets together. Is this way OK?

If necessary, you could certainly do this. However, I would break upyour wide file so that you take one or more variables for all yearstogether. Combining the resulting long files will then involvemerging rather than appending.



-- Phil

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: reshape a large wide longitudinal data set to long
  - From: Amanda Fu <[email protected]>

References:
- st: reshape a large wide longitudinal data set to long
  - From: Amanda Fu <[email protected]>

Prev by Date: Re: st: Making a local with no assigned value zero
Next by Date: Re: st: RE: Outsheet with adding new lines
Previous by thread: st: reshape a large wide longitudinal data set to long
Next by thread: Re: st: reshape a large wide longitudinal data set to long
Index(es):
- Date
- Thread