Statalist The Stata Listserver


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

RE: st: Import csv file


From   "Nuno Soares" <ns.mlists@gmail.com>
To   <statalist@hsphsun2.harvard.edu>
Subject   RE: st: Import csv file
Date   Tue, 29 Aug 2006 15:40:05 +0100

Hi Nick,

Thanks for your insight. However, I've ran into a problem when running your code. The code runs fine for your example, but when we replace the company names by actual names, we have problems with spaces in the companies' names or names that have characters that aren't allowed as var names by Stata (ex. I have companies called "@UK",  "Marks & Spencer", "Azure Holdings", etc.). These names can't be used your first rename procedure... That's a pitty because your code is simpler...
Best,

-----Original Message-----
From: owner-statalist@hsphsun2.harvard.edu [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of Nick Cox
Sent: 29 August 2006 14:53
To: statalist@hsphsun2.harvard.edu
Subject: RE: st: Import csv file

This solution seems a bit more complicated than necessary. 

I started with this soares.csv

Name,Company1,Company2,Company3
Code,0000(MV),0001(MV),0002(MV)
Currency,GBP,GBP,USD
31/12/1986,1,2,3
30/01/1987,4,5,6
30/12/2005,7,8,9

and then ran this 

insheet using soares.csv, clear
foreach v of var * {
	rename `v' `=`v'[1]'
}
drop in 1
gen long id = _n
reshape long Company , i(id)
gen Code = Company if  Name == "Code"
gen Currency = Company if Name == "Currency"
bysort _j (Code) : replace Code = Code[_N] bysort _j (Currency) : replace Currency = Currency[_N] drop if inlist(Name, "Code", "Currency") drop id rename Company Value rename _j Company gen Date = date(Name, "dmy") format Date %dd/n/CY drop Name destring Value sort Company Date

The results are

. l

     +----------------------------------------------------+
     | Company   Value       Code   Currency         Date |
     |----------------------------------------------------|
  1. |       1       1   0000(MV)        GBP   31/12/1986 |
  2. |       1       4   0000(MV)        GBP    30/1/1987 |
  3. |       1       7   0000(MV)        GBP   30/12/2005 |
  4. |       2       2   0001(MV)        GBP   31/12/1986 |
  5. |       2       5   0001(MV)        GBP    30/1/1987 |
     |----------------------------------------------------|
  6. |       2       8   0001(MV)        GBP   30/12/2005 |
  7. |       3       3   0002(MV)        USD   31/12/1986 |
  8. |       3       6   0002(MV)        USD    30/1/1987 |
  9. |       3       9   0002(MV)        USD   30/12/2005 |
     +----------------------------------------------------+

What are the key differences? 

1. No need to resort to a user-written program -sxpose-, no matter how much you trust it. 

2. Date manipulations are much simpler with -date()-. 
You _don't_ have to -destring- the date. Also dropping the "/" is dangerous! 

Nick
n.j.cox@durham.ac.uk 

Nuno Soares
 
> Thanks a lot Radu! Just had to destring the date_ variables. 
> Here goes the code. Might be useful to other statalisters:
> 
> ***************** BEGIN CODE ***********************
> 
> clear
> *reading in the data
> insheet using "C:\mv1.csv", clear
> 
> *transposing
> sxpose, clear
> 
> *renaming the first three variables
> foreach v of varlist _var1 _var2 _var3 { local i = `v'[1] rename `v' 
> `i'
> }
> 
> *getting rid of "/" in the dates
> *note that after renaming, only the date variables start with "_var"
> destring _var*, replace ignore("/") force
> 
> *now renaming the remaining variables
> foreach v of varlist _var* {
> local i = `v'[1]
> rename `v' data`i'
> }
> 
> *dropping the first observation (because this is just the name of the 
> variable) drop in 1
> 
> *reshaping
> reshape long data, i(Name) j(date) string
> 
> 
> *now you're left with the date in a string format, which you can 
> convert to a proper date, in a couple of steps
> 
> gen str date_year = substr(date, -4,.) gen str date_month = 
> substr(date, 3, 2) gen str date_day = substr(date, 1, 2) destring  
> date_year date_month date_day, replace
> 
> gen truedate = mdy(date_month, date_day, date_year) format truedate %d
> 
> ***************** END CODE ***********************

Radu Ban
 
> Here's a try. For this you would need to install the -sxpose- command 
> (do a -findit sxpose- and then follow the link to install). This is 
> needed because the built-in -xpose- command doesn't work with strings.
> 
> ******************************
> 
> *reading in the data
> insheet using "C:\somedir\myfile.csv", clear
> 
> *transposing
> sxpose, clear
> 
> *renaming the first three variables
> foreach v of varlist _var1 _var2 _var3 { local i = `v'[1] rename `v' 
> `i'
> }
> 
> *getting rid of "/" in the dates
> *note that after renaming, only the date variables start with "_var"
> destring _var*, replace ignore("/") force
> 
> *now renaming the remaining variables
> foreach v of varlist _var* {
> local i = `v'[1]
> rename `v' data`i'
> }
> 
> *dropping the first observation (because this is just the name of the 
> variable) drop in 1
> 
> *reshaping
> reshape long data, i(Name) j(date) string
> 
> 
> *now you're left with the date in a string format, which you can 
> convert to a proper *date, in a couple of steps
> 
> gen str date_year = substr(date, -4,.) gen str date_month = 
> substr(date, 3, 2) gen str date_day = substr(date, 1, 2)
> 
> gen truedate = mdy(date_month, date_day, date_year)
> 
> *now you can format the date using the preferred date format, for 
> example format truedate %d
> ********************
> 
> I hope this helps. Also if you want to run this for several .csv 
> files, check out the archives for last week, there's a thread on how 
> to loop over all files in a directory.
 

Nuno Soares 

> > I need to import some csv files that have an awkward
> format. The format is always the same:
> >
> > Name Company1 Company2 Company3 ...
> > Code 0000(MV) 0001(MV) 0002(MV) ...
> > Currency £ £ € ...
> > 31/12/1986 data1 data2 data3 ...
> > 30/01/1987 dataX dataY dataZ ...
> > ...
> > 30/12/2005 dataW dataJ dataK ...
> >
> > The number of companies is variable (ie, columns), and its
> width is also variable. Data1, data2, etc. are numeric.
> >
> > Can you please tell me if it is possible to import to Stata
> the data described, in the following format:
> >
> > Company  Code   Currency  Date  Data
> > Company1 0000(MV) £   31/12/1986 data1
> > Company1 0000(MV) £   30/01/1987 dataX
> > ...
> > Company1 0000(MV) £   30/12/2005 dataW
> > Company2 0001(MV) £   31/12/1986 data2
> > Company2 0001(MV) £   30/01/1987 dataY
> > ...
> > Company2 0001(MV) £   30/12/2005 dataJ
> > Company3 0002(MV) £   31/12/1986 data3
> > Company3 0002(MV) £   30/01/1987 dataZ
> > ...
> > Company3 0002(MV) £   30/12/2005 dataK
> > and so on.
> >
> > Some treatment on the original data can be done in Excel,
> but I wanted to avoid it (I have almost one hundred files that I need 
> to import...). If any one has an idea of how to do it, I would be 
> greatly appreciated!

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index