Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: data manipulation help


From   "Nick Cox" <n.j.cox@durham.ac.uk>
To   <statalist@hsphsun2.harvard.edu>
Subject   Re: st: data manipulation help
Date   Sat, 14 Aug 2004 15:31:45 +0100

Yumin Sheng asked [slight editing here,  combining 
two postings] 

---------------------
I have been trying to save data from World Development
Indicators (WDI, online) into a cross-section
time-series format. The original data look like the
following

Country	Variable Name	1960	1961	...	2002
Afghanistan	Aid	X	X	...	X
Afghanistan	Population	X	X	...X 
...	...	...	...	...	...
...	...	...	...	...	...
Afghanistan	Wage	X	X	...	X
Albania	 Aid		X	X        ...	X
Albania	Population	X	X	...	X
...	...	...	...	...	...
...	...	...	...	...	...
Albania	Wage	X	X	...	X

The ideal format I would like to have would be:

Country	Aid	Population	...	...	Wage
Afghanistan	1960	X	X	...	X
Afghanistan	1961	X	X	...	X
...	...	...	...	...	...
...	...	...	...	...	...
Afghanistan	2002	X	X	...	X
Albania	1960	X	X	...	X
Albania	1961	X	X	...	X
...	...	...	...	...	...
...	...	...	...	...	...
Albania	2002	X	X	...	X
...	...	...	...	...	...
China	1960	X	X	...	X
China	1961	X	X	...	X
...	...	...	...	...	...
...	...	...	...	...	...
China	2002	X	X	...	X

World Development Indicators have about 200 variables
for each country. 

But (1) sometimes many words are contained in
one variable name such as "Export share in GDP, %"
and (2) some countries have data on more
variables than do other countries. 
--------------------

Dev Vencappa recommended -reshape- and Toyoto Iwata 
posted code for a example dataset. Toyoto's code 
doesn't extend to coping with the first problem 
above and I am not sure that it would cope with the 
second problem above. 

Here is a sketch of slightly more general code:
the dataset is just an example. (I guess that 
there is a shorter and still general solution
but I can't find it.) 

. l 

     +------------------------------------------+
     |     country     variable   v1960   v1961 |
     |------------------------------------------|
  1. | Afghanistan          Aid       1       6 |
  2. | Afghanistan   Population       2       7 |
  3. |     Albania          Aid       3       8 |
  4. |     Albania      Fishing       4       9 |
  5. |     Albania   Population       5      10 |
     +------------------------------------------+

-reshape- to long and then rectangularise. 
-fillin- takes care of problem (2). 

. reshape long v , i(country var) j(year) 
. fillin country var year 
. drop _fillin 

We know from problem (1) that the values in 
-variable- aren't all legal variable names, 
so we can't use them directly. One trick is 

. egen group = group(variable) 

That gives ways of counting variables and 
years, so our code is more general: 

. su group , meanonly 
. local nvars = r(max) 
. qui count if country == country[1] & group == group[1]
. local nyears = r(N) 

Now we split the composite -v- into variables
and assign variable labels, picking them up 
from -variable-: 

. gen order = _n 

. qui forval i = 1/`nvars' { 
.	gen v`i' = v if group == `i' 
.	su order if group == `i', meanonly 
.	label var v`i' `"`=variable[`r(min)']'"' 
. }

. drop v variable order 

. l 

     +-------------------------------------------+
     |     country   year   group   v1   v2   v3 |
     |-------------------------------------------|
  1. | Afghanistan   1960       1    1    .    . |
  2. | Afghanistan   1961       1    6    .    . |
  3. | Afghanistan   1960       2    .    .    . |
  4. | Afghanistan   1961       2    .    .    . |
  5. | Afghanistan   1960       3    .    .    2 |
     |-------------------------------------------|
  6. | Afghanistan   1961       3    .    .    7 |
  7. |     Albania   1960       1    3    .    . |
  8. |     Albania   1961       1    8    .    . |
  9. |     Albania   1960       2    .    4    . |
 10. |     Albania   1961       2    .    9    . |
     |-------------------------------------------|
 11. |     Albania   1960       3    .    .    5 |
 12. |     Albania   1961       3    .    .   10 |
     +-------------------------------------------+

We need to pull values for most variables from 
later in order: 

. qui forval j = 2/`nvars' { 
. 	replace v`j' = v`j'[_n+`nyears' * (`j' - 1)] 
. }

. keep if group == 1
. drop group 

. l 

     +-----------------------------------+
     |     country   year   v1   v2   v3 |
     |-----------------------------------|
  1. | Afghanistan   1960    1    .    2 |
  2. | Afghanistan   1961    6    .    7 |
  3. |     Albania   1960    3    4    5 |
  4. |     Albania   1961    8    9   10 |
     +-----------------------------------+

I also looked at a solution centred on -xpose-, 
but the preparation and clean-up were not 
easier than this. 

Nick 
n.j.cox@durham.ac.uk 

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index