[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
yumin sheng <shengyumin@yahoo.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
st: thanks again, more Stata data manipulation tips, or alternative statistical software? |

Date |
Mon, 23 Aug 2004 12:15:34 -0700 (PDT) |

Dear all, Thanks so much again to Dev for recommending –reshape-, to Toyoto for venturing an illustrative example, and especially to Professor Nick Cox for kindly writing the immensely helpful code to “reshape” the WDI data. Sorry that I haven’t been able to follow up on this until now; it took my computer many days to figure out that Stata might not be able to easily “manipulate” dataset of this large size. Indeed, although with 2 or 3 countries’ worth of data from WDI each time, Stata 8 was able to perform Nick’s powerful “reshape” code quite smoothly and speedily, it seemed very slow when the entire dataset of around 200 countries (with 200-500 variables each, for 43 years) is included. Nick’s first-line –reshape- command took about 24 hours to execute, the “fillin” command took much longer and could never finish completely after about 60 hours before Stata finally crashed. The original WDI data is 62.5 MB in Stata, and the size of the transformed data after the first command “reshape long v , i(country var) j(year)” grows to 540 MB. My computer has a memory of about 512 MB DDR SDRAM at 333MHz. The hard drive has 80 GB. Maybe it is because I am just using the intercooled version of Stata 8, not the SE, but I also set the Stata memory to 1000 MB. Is it true that Stata 8 is no longer suitable for data manipulation of this kind? If not, how should I proceed next so that Nick’s code could be run to the finishing line? If yes, could anyone kindly suggest alternative statistical software that might do the job? Terribly sorry to trouble you all again with this, but your wise suggestion will be tremendously appreciated, as always! Best wishes, Yumin --- Nick Cox <n.j.cox@durham.ac.uk> wrote: > Yumin Sheng asked [slight editing here, combining > two postings] > > --------------------- > I have been trying to save data from World > Development > Indicators (WDI, online) into a cross-section > time-series format. The original data look like the > following > > Country Variable Name 1960 1961 ... 2002 > Afghanistan Aid X X ... X > Afghanistan Population X X ...X > ... ... ... ... ... ... > ... ... ... ... ... ... > Afghanistan Wage X X ... X > Albania Aid X X ... X > Albania Population X X ... X > ... ... ... ... ... ... > ... ... ... ... ... ... > Albania Wage X X ... X > > The ideal format I would like to have would be: > > Country Aid Population ... ... Wage > Afghanistan 1960 X X ... X > Afghanistan 1961 X X ... X > ... ... ... ... ... ... > ... ... ... ... ... ... > Afghanistan 2002 X X ... X > Albania 1960 X X ... X > Albania 1961 X X ... X > ... ... ... ... ... ... > ... ... ... ... ... ... > Albania 2002 X X ... X > ... ... ... ... ... ... > China 1960 X X ... X > China 1961 X X ... X > ... ... ... ... ... ... > ... ... ... ... ... ... > China 2002 X X ... X > > World Development Indicators have about 200 > variables > for each country. > > But (1) sometimes many words are contained in > one variable name such as "Export share in GDP, %" > and (2) some countries have data on more > variables than do other countries. > -------------------- > > Dev Vencappa recommended -reshape- and Toyoto Iwata > posted code for a example dataset. Toyoto's code > doesn't extend to coping with the first problem > above and I am not sure that it would cope with the > second problem above. > > Here is a sketch of slightly more general code: > the dataset is just an example. (I guess that > there is a shorter and still general solution > but I can't find it.) > > . l > > +------------------------------------------+ > | country variable v1960 v1961 | > |------------------------------------------| > 1. | Afghanistan Aid 1 6 | > 2. | Afghanistan Population 2 7 | > 3. | Albania Aid 3 8 | > 4. | Albania Fishing 4 9 | > 5. | Albania Population 5 10 | > +------------------------------------------+ > > -reshape- to long and then rectangularise. > -fillin- takes care of problem (2). > > . reshape long v , i(country var) j(year) > . fillin country var year > . drop _fillin > > We know from problem (1) that the values in > -variable- aren't all legal variable names, > so we can't use them directly. One trick is > > . egen group = group(variable) > > That gives ways of counting variables and > years, so our code is more general: > > . su group , meanonly > . local nvars = r(max) > . qui count if country == country[1] & group == > group[1] > . local nyears = r(N) > > Now we split the composite -v- into variables > and assign variable labels, picking them up > from -variable-: > > . gen order = _n > > . qui forval i = 1/`nvars' { > . gen v`i' = v if group == `i' > . su order if group == `i', meanonly > . label var v`i' `"`=variable[`r(min)']'"' > . } > > . drop v variable order > > . l > > +-------------------------------------------+ > | country year group v1 v2 v3 | > |-------------------------------------------| > 1. | Afghanistan 1960 1 1 . . | > 2. | Afghanistan 1961 1 6 . . | > 3. | Afghanistan 1960 2 . . . | > 4. | Afghanistan 1961 2 . . . | > 5. | Afghanistan 1960 3 . . 2 | > |-------------------------------------------| > 6. | Afghanistan 1961 3 . . 7 | > 7. | Albania 1960 1 3 . . | > 8. | Albania 1961 1 8 . . | > 9. | Albania 1960 2 . 4 . | > 10. | Albania 1961 2 . 9 . | > |-------------------------------------------| > 11. | Albania 1960 3 . . 5 | > 12. | Albania 1961 3 . . 10 | > +-------------------------------------------+ > > We need to pull values for most variables from > later in order: > > . qui forval j = 2/`nvars' { > . replace v`j' = v`j'[_n+`nyears' * (`j' - 1)] > . } > > . keep if group == 1 > . drop group > > . l > > +-----------------------------------+ > | country year v1 v2 v3 | > |-----------------------------------| > 1. | Afghanistan 1960 1 . 2 | > 2. | Afghanistan 1961 6 . 7 | > 3. | Albania 1960 3 4 5 | > 4. | Albania 1961 8 9 10 | > +-----------------------------------+ > > I also looked at a solution centred on -xpose-, > but the preparation and clean-up were not > easier than this. > > Nick > n.j.cox@durham.ac.uk > > * > * For searches and help try: > * > http://www.stata.com/support/faqs/res/findit.html > * http://www.stata.com/support/statalist/faq > * http://www.ats.ucla.edu/stat/stata/ > _______________________________ Do you Yahoo!? Win 1 of 4,000 free domain names from Yahoo! Enter now. http://promotions.yahoo.com/goldrush * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**References**:**Re: st: data manipulation help***From:*"Nick Cox" <n.j.cox@durham.ac.uk>

- Prev by Date:
**st: RE: Eq command** - Next by Date:
**st: STATA and NHIS data - Elementary Question** - Previous by thread:
**st: conditional logit with constraint?** - Next by thread:
**st: Error message: input too long; after lab def** - Index(es):

© Copyright 1996–2016 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |