[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
st: thanks again, more Stata data manipulation tips, or alternative statistical software?

From	yumin sheng <[email protected]>
To	[email protected]
Subject	st: thanks again, more Stata data manipulation tips, or alternative statistical software?
Date	Mon, 23 Aug 2004 12:15:34 -0700 (PDT)
Dear all,

  Thanks so much again to Dev for recommending
�reshape-, to Toyoto for venturing an illustrative
example, and especially to Professor Nick Cox for
kindly writing the immensely helpful code to �reshape�
the WDI data. Sorry that I haven�t been able to follow
up on this until now; it took my computer many days to
figure out that Stata might not be able to easily
�manipulate� dataset of this large size. 

  Indeed, although with 2 or 3 countries� worth of
data from WDI each time, Stata 8 was able to perform
Nick�s powerful �reshape� code quite smoothly and
speedily, it seemed very slow when the entire dataset
of around 200 countries (with 200-500 variables each,
for 43 years) is included. Nick�s first-line �reshape-
command took about 24 hours to execute, the �fillin�
command took much longer and could never finish
completely after about 60 hours before Stata finally
crashed.

  The original WDI data is 62.5 MB in Stata, and the
size of the transformed data after the first command
�reshape long v , i(country var) j(year)� grows to 540
MB. My computer has a memory of about 512 MB DDR SDRAM
at 333MHz. The hard drive has 80 GB. Maybe it is
because I am just using the intercooled version of
Stata 8, not the SE, but I also set the Stata memory
to 1000 MB.

  Is it true that Stata 8 is no longer suitable for
data manipulation of this kind? If not, how should I
proceed next so that Nick�s code could be run to the
finishing line? If yes, could anyone kindly suggest
alternative statistical software that might do the
job?

  Terribly sorry to trouble you all again with this,
but your wise suggestion will be tremendously
appreciated, as always!

Best wishes,
Yumin

--- Nick Cox <[email protected]> wrote:

> Yumin Sheng asked [slight editing here,  combining 
> two postings] 
> 
> ---------------------
> I have been trying to save data from World
> Development
> Indicators (WDI, online) into a cross-section
> time-series format. The original data look like the
> following
> 
> Country	Variable Name	1960	1961	...	2002
> Afghanistan	Aid	X	X	...	X
> Afghanistan	Population	X	X	...X 
> ...	...	...	...	...	...
> ...	...	...	...	...	...
> Afghanistan	Wage	X	X	...	X
> Albania	 Aid		X	X        ...	X
> Albania	Population	X	X	...	X
> ...	...	...	...	...	...
> ...	...	...	...	...	...
> Albania	Wage	X	X	...	X
> 
> The ideal format I would like to have would be:
> 
> Country	Aid	Population	...	...	Wage
> Afghanistan	1960	X	X	...	X
> Afghanistan	1961	X	X	...	X
> ...	...	...	...	...	...
> ...	...	...	...	...	...
> Afghanistan	2002	X	X	...	X
> Albania	1960	X	X	...	X
> Albania	1961	X	X	...	X
> ...	...	...	...	...	...
> ...	...	...	...	...	...
> Albania	2002	X	X	...	X
> ...	...	...	...	...	...
> China	1960	X	X	...	X
> China	1961	X	X	...	X
> ...	...	...	...	...	...
> ...	...	...	...	...	...
> China	2002	X	X	...	X
> 
> World Development Indicators have about 200
> variables
> for each country. 
> 
> But (1) sometimes many words are contained in
> one variable name such as "Export share in GDP, %"
> and (2) some countries have data on more
> variables than do other countries. 
> --------------------
> 
> Dev Vencappa recommended -reshape- and Toyoto Iwata 
> posted code for a example dataset. Toyoto's code 
> doesn't extend to coping with the first problem 
> above and I am not sure that it would cope with the 
> second problem above. 
> 
> Here is a sketch of slightly more general code:
> the dataset is just an example. (I guess that 
> there is a shorter and still general solution
> but I can't find it.) 
> 
> . l 
> 
>      +------------------------------------------+
>      |     country     variable   v1960   v1961 |
>      |------------------------------------------|
>   1. | Afghanistan          Aid       1       6 |
>   2. | Afghanistan   Population       2       7 |
>   3. |     Albania          Aid       3       8 |
>   4. |     Albania      Fishing       4       9 |
>   5. |     Albania   Population       5      10 |
>      +------------------------------------------+
> 
> -reshape- to long and then rectangularise. 
> -fillin- takes care of problem (2). 
> 
> . reshape long v , i(country var) j(year) 
> . fillin country var year 
> . drop _fillin 
> 
> We know from problem (1) that the values in 
> -variable- aren't all legal variable names, 
> so we can't use them directly. One trick is 
> 
> . egen group = group(variable) 
> 
> That gives ways of counting variables and 
> years, so our code is more general: 
> 
> . su group , meanonly 
> . local nvars = r(max) 
> . qui count if country == country[1] & group ==
> group[1]
> . local nyears = r(N) 
> 
> Now we split the composite -v- into variables
> and assign variable labels, picking them up 
> from -variable-: 
> 
> . gen order = _n 
> 
> . qui forval i = 1/`nvars' { 
> .	gen v`i' = v if group == `i' 
> .	su order if group == `i', meanonly 
> .	label var v`i' `"`=variable[`r(min)']'"' 
> . }
> 
> . drop v variable order 
> 
> . l 
> 
>      +-------------------------------------------+
>      |     country   year   group   v1   v2   v3 |
>      |-------------------------------------------|
>   1. | Afghanistan   1960       1    1    .    . |
>   2. | Afghanistan   1961       1    6    .    . |
>   3. | Afghanistan   1960       2    .    .    . |
>   4. | Afghanistan   1961       2    .    .    . |
>   5. | Afghanistan   1960       3    .    .    2 |
>      |-------------------------------------------|
>   6. | Afghanistan   1961       3    .    .    7 |
>   7. |     Albania   1960       1    3    .    . |
>   8. |     Albania   1961       1    8    .    . |
>   9. |     Albania   1960       2    .    4    . |
>  10. |     Albania   1961       2    .    9    . |
>      |-------------------------------------------|
>  11. |     Albania   1960       3    .    .    5 |
>  12. |     Albania   1961       3    .    .   10 |
>      +-------------------------------------------+
> 
> We need to pull values for most variables from 
> later in order: 
> 
> . qui forval j = 2/`nvars' { 
> . 	replace v`j' = v`j'[_n+`nyears' * (`j' - 1)] 
> . }
> 
> . keep if group == 1
> . drop group 
> 
> . l 
> 
>      +-----------------------------------+
>      |     country   year   v1   v2   v3 |
>      |-----------------------------------|
>   1. | Afghanistan   1960    1    .    2 |
>   2. | Afghanistan   1961    6    .    7 |
>   3. |     Albania   1960    3    4    5 |
>   4. |     Albania   1961    8    9   10 |
>      +-----------------------------------+
> 
> I also looked at a solution centred on -xpose-, 
> but the preparation and clean-up were not 
> easier than this. 
> 
> Nick 
> [email protected] 
> 
> *
> *   For searches and help try:
> *  
> http://www.stata.com/support/faqs/res/findit.html
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
> 



		
_______________________________
Do you Yahoo!?
Win 1 of 4,000 free domain names from Yahoo! Enter now.
http://promotions.yahoo.com/goldrush
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
References:
- Re: st: data manipulation help
  - From: "Nick Cox" <[email protected]>
Prev by Date: st: RE: Eq command
Next by Date: st: STATA and NHIS data - Elementary Question
Previous by thread: st: conditional logit with constraint?
Next by thread: st: Error message: input too long; after lab def
Index(es):
- Date
- Thread