Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: RE: RE: Use foreach or forvalues to create the long form data


From   "Supnithadnaporn, Anupit" <gtg065t@mail.gatech.edu>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: RE: RE: Use foreach or forvalues to create the long form data
Date   Sat, 18 Oct 2008 20:38:26 -0400 (EDT)

Thank you for all the suggestions. I will try all of them and report the results later.

Anupit


----- "Friedrich Huebler" <fhuebler@gmail.com> wrote:

> Anupit,
> 
> -reshape- is indeed slow. You can change the structure of your data
> by
> saving the contents of the variables in individual files that are
> subsequently combined with -append-. The difference between the two
> approaches can be demonstrated with the auto data.
> 
> First, create a dataset with 148,000 observations in 10 variables,
> plus an identifier.
> 
> sysuse auto, clear
> drop make foreign
> local i = 1
> foreach var of varlist * {
>   ren `var' var`i'
>   local ++i
> }
> expand 2000
> gen i = _n
> 
> We can now -reshape- the data from wide to long.
> 
> reshape long var, i(i) j(j)
> 
> The alternative solution does not rely on -reshape-. Instead, we use
> -forvalues- in combination with -preserve-, -keep-, -save-, -restore-
> and -append-.
> 
> d, s
> local j = r(k) - 1
> forvalues i = 1/`j' {
>   preserve
>   keep i var`i'
>   rename var`i' var
>   gen j = `i'
>   tempfile var`i'
>   save `var`i''
>   restore
> }
> use `var1', clear
> forvalues i = 2/`j' {
>   append using `var`i''
> }
> 
> -reshape- is more convenient because it only takes one line of code.
> This convenience comes at the cost of processing time and memory
> requirements. On my PC the first solution with -reshape- takes about
> 11 seconds. The second solution takes less than 2 seconds and also
> needs less memory.
> 
> Friedrich
> 
> On Thu, Oct 16, 2008 at 10:57 AM, Supnithadnaporn, Anupit
> <gtg065t@mail.gatech.edu> wrote:
> > Hello,
> >
> > Martin and Nick, thank you so much. Your suggestion works very
> well.
> >
> > I am sorry for being unclear about my question.
> > I will clarify it now. I have a large dataset around 1 million
> records. There are
> > around 20 variables. I would like to run the clogit regression,
> which requires me
> > to reshape the data into the long form. Basically, it is the model
> of a person
> > choosing a product from the choice set of 12. Thus, the total
> records  would be of
> > 12 million after reshaping.
> >
> > In the beginning, I tried with the smaller sample and reshape worked
> very slow.
> > Then, I thought I should start with only 2 ID variables: person ID
> and choice ID (1-12).
> > I created the wide form data of only 2 ID variables, reshaped it to
> the long form, and
> > lastly merged other variables that are associate with person and
> choice respectively.
> >
> > However, even with the only 2 ID variables, it took a long time for
> reshape to finish
> > for the small sample of the total 1 million records. That is why I
> try to find the
> > faster way to create the empty dataset with 2 ID variables first.
> Then my next step
> > is to merge the information about a person and the choice.
> >
> > I hope this is clear enough. And if you and others have other better
> approach to
> > prepare the data like this, please let me know. There are a lot more
> for me to learn
> > from all of you.
> >
> > Thank you,
> > Anupit
> >
> >
> > ----- "Nick Cox" <n.j.cox@durham.ac.uk> wrote:
> >
> >> Martin's advice looks good.
> >>
> >> But Anupit's question doesn't hang together for me. The specific
> >> example, and even longer ones of the same form, don't strike me as
> >> -reshape- questions at all as they involve creating new data in
> >> structured form.
> >>
> >> By the way, for large datasets make sure to use -egen long- or
> -egen
> >> double- if you need to.
> >>
> >> But if you had a -reshape- question, strict sense, I doubt you
> could
> >> speed things up much by programming it yourself with -forvalues-
> or
> >> -foreach-. That would, broadly speaking, mean that you were a
> better
> >> Stata programmer than the Stata developers. There could well be
> >> exceptions, but I'd guess that this statement would be true much
> more
> >> often than its converse.
> >>
> >> Nick
> >> n.j.cox@durham.ac.uk
> >>
> >> Martin Weiss
> >>
> >> - h egen,seq()-
> >>
> >> Supnithadnaporn, Anupit
> >>
> >> Would you please suggest me how to create data in the long form
> >> by *not* using reshape? I would like to avoid reshape because
> reshape
> >> takes very very long time. In fact, the final & total number of
> >> records
> >> that I have to create would be around 12,000,000.
> >>
> >> I think foreach and forvalues can do this work.
> >> But, I am a novice in Stata programming and could not figure out
> so
> >> far.
> >>
> >> In the beginning, I have only Obsid which is created by
> >>
> >> gen Obsid = _n
> >>
> >> The desired data would look like this:
> >>
> >> Obsid   Vid     Imp
> >> 1       1       1
> >> 2       1       2
> >> 3       1       3
> >> 4       1       4
> >> 5       2       1
> >> 6       2       2
> >> 7       2       3
> >> 8       2       4
> >>
> >> ...
> >>
> >>
> >> 100     25      4
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index