Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: RE: RE: Use foreach or forvalues to create the long form data


From   "Friedrich Huebler" <fhuebler@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: RE: RE: Use foreach or forvalues to create the long form data
Date   Thu, 16 Oct 2008 11:52:01 -0400

Anupit,

-reshape- is indeed slow. You can change the structure of your data by
saving the contents of the variables in individual files that are
subsequently combined with -append-. The difference between the two
approaches can be demonstrated with the auto data.

First, create a dataset with 148,000 observations in 10 variables,
plus an identifier.

sysuse auto, clear
drop make foreign
local i = 1
foreach var of varlist * {
  ren `var' var`i'
  local ++i
}
expand 2000
gen i = _n

We can now -reshape- the data from wide to long.

reshape long var, i(i) j(j)

The alternative solution does not rely on -reshape-. Instead, we use
-forvalues- in combination with -preserve-, -keep-, -save-, -restore-
and -append-.

d, s
local j = r(k) - 1
forvalues i = 1/`j' {
  preserve
  keep i var`i'
  rename var`i' var
  gen j = `i'
  tempfile var`i'
  save `var`i''
  restore
}
use `var1', clear
forvalues i = 2/`j' {
  append using `var`i''
}

-reshape- is more convenient because it only takes one line of code.
This convenience comes at the cost of processing time and memory
requirements. On my PC the first solution with -reshape- takes about
11 seconds. The second solution takes less than 2 seconds and also
needs less memory.

Friedrich

On Thu, Oct 16, 2008 at 10:57 AM, Supnithadnaporn, Anupit
<gtg065t@mail.gatech.edu> wrote:
> Hello,
>
> Martin and Nick, thank you so much. Your suggestion works very well.
>
> I am sorry for being unclear about my question.
> I will clarify it now. I have a large dataset around 1 million records. There are
> around 20 variables. I would like to run the clogit regression, which requires me
> to reshape the data into the long form. Basically, it is the model of a person
> choosing a product from the choice set of 12. Thus, the total records  would be of
> 12 million after reshaping.
>
> In the beginning, I tried with the smaller sample and reshape worked very slow.
> Then, I thought I should start with only 2 ID variables: person ID and choice ID (1-12).
> I created the wide form data of only 2 ID variables, reshaped it to the long form, and
> lastly merged other variables that are associate with person and choice respectively.
>
> However, even with the only 2 ID variables, it took a long time for reshape to finish
> for the small sample of the total 1 million records. That is why I try to find the
> faster way to create the empty dataset with 2 ID variables first. Then my next step
> is to merge the information about a person and the choice.
>
> I hope this is clear enough. And if you and others have other better approach to
> prepare the data like this, please let me know. There are a lot more for me to learn
> from all of you.
>
> Thank you,
> Anupit
>
>
> ----- "Nick Cox" <n.j.cox@durham.ac.uk> wrote:
>
>> Martin's advice looks good.
>>
>> But Anupit's question doesn't hang together for me. The specific
>> example, and even longer ones of the same form, don't strike me as
>> -reshape- questions at all as they involve creating new data in
>> structured form.
>>
>> By the way, for large datasets make sure to use -egen long- or -egen
>> double- if you need to.
>>
>> But if you had a -reshape- question, strict sense, I doubt you could
>> speed things up much by programming it yourself with -forvalues- or
>> -foreach-. That would, broadly speaking, mean that you were a better
>> Stata programmer than the Stata developers. There could well be
>> exceptions, but I'd guess that this statement would be true much more
>> often than its converse.
>>
>> Nick
>> n.j.cox@durham.ac.uk
>>
>> Martin Weiss
>>
>> - h egen,seq()-
>>
>> Supnithadnaporn, Anupit
>>
>> Would you please suggest me how to create data in the long form
>> by *not* using reshape? I would like to avoid reshape because reshape
>> takes very very long time. In fact, the final & total number of
>> records
>> that I have to create would be around 12,000,000.
>>
>> I think foreach and forvalues can do this work.
>> But, I am a novice in Stata programming and could not figure out so
>> far.
>>
>> In the beginning, I have only Obsid which is created by
>>
>> gen Obsid = _n
>>
>> The desired data would look like this:
>>
>> Obsid   Vid     Imp
>> 1       1       1
>> 2       1       2
>> 3       1       3
>> 4       1       4
>> 5       2       1
>> 6       2       2
>> 7       2       3
>> 8       2       4
>>
>> ...
>>
>>
>> 100     25      4
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index