Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down at the end of May, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: Re: Concatenate files


From   Eric Booth <eric.a.booth@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   st: Re: Concatenate files
Date   Fri, 6 Apr 2012 11:05:01 -0500

<>

Hi Andrew:

Your examples made things clearer.  Here is a edited version of the script I sent you earlier that works to append these files snippets you sent.  I put comments below where I edited/added code:

**********************!

clear
sa "master.dta", replace emptyok  //here is your 'master' file

**build the filelist ==>
local files: dir `"`c(pwd)'"' files "*.csv", respectcase nofail
di `"`files'"' //this is your file list
local files: subinstr local files ".csv" "", all
di `"`files'"' //scrubs the extension so that we can use it below

****loop over file list & append==>
foreach f in `files' {
 *--this part will clean the csv file and make a stata version:
	preserve
		insheet using "`f'.csv", comma clear nonames //i changed this to 'no'names
		qui desc
		di `"`r(k)'"' //contains # of vars
		if `r(k)' == 1 {
			ds
			rename `r(varlist)' var2
			}
/* Added part below to rename vars in 2col - running desc & if again is redundant but might make it easier to map back to the original example I gave */
		qui desc
		if `r(k)' == 2 {
			ds
			loc j `"`r(varlist)'"' //to make the macro below more clear
			rename `:word 1 of `j'' var1
			rename `:word 2 of `j'' var2
			}
		tostring *, replace
		sa "`f'.dta", replace // you'll use this for the append
	restore
 *--now append the cleaned file to the master:
	append using "`f'.dta"
	save "master.dta", replace
	} //go back to the top and clean, save, append next file..




**here's your appended master file==>	
u "master.dta", clear
desc
destring * , replace ignore(",")  //edited
list //added
**********************!

- Eric


__
Eric A. Booth
Public Policy Research Institute 
Texas A&M University
ebooth@ppri.tamu.edu
+979.845.6754



On Apr 6, 2012, at 9:30 AM, Andrew Hua wrote:

> Hi Eric,
> 
> Thanks a lot for getting back to me! I enclose two files with this
> email, one single-column file and one two-column file, for your
> reference. The column in the single-column file should be appended to
> the second column of the two-column file, leaving the first column in
> the two-column file either blank.
> 
> Thanks again!
> 
> Best regards,
> 
> Andrew
> 
> On Fri, Apr 6, 2012 at 10:08 AM, Eric Booth <eric.a.booth@gmail.com> wrote:
>> 
>> Hi Andrew:  I cannot see where you get the error -- so I don't know what the issue is.
>>  If you can send me a snippet (e.g., the top 5 rows) of at least 1 single col. and 1 two col. file, then maybe I can sort the issue out.
>> 
>> (better: Also put an example of how that single and double column file would look appended so that I know we are thinking about combining the files in the same manner (( my fear is that some of your language implies a join/merge and that the append is approaching it the wrong way)).
>> 
>> 
>> On Apr 5, 2012, at 4:28 PM, Andrew Hua wrote:
>> 
>>> Hi Eric,
>>> 
>>> I really appreciate your time offering the solution to the questions I
>>> posted to statalist. Your code made sense to me, although I wanted to
>>> match the variable in the single column file with the second variable
>>> in the two column file. So, I changed "rename `r(varlist)' var1" to
>>> "rename `r(varlist)' var2". That was the only change I made in your
>>> code. I really thought it should work but still got a error: r(106).
>>> The resultant master file had 61 variables not 2 variables as I
>>> expected.  You further suggestions will be highly appreciated!
>>> 
>>> Best regards,
>>> 
>>> Andrew
>>> 
>>> The following is your original email and my email. Thanks!
>>> 
>>> 1.  -append- will stack/combine together variables from 2 datasets
>>> with the same name (variables with different names won't stack
>>> together).  You could write a loop to open all the .csv files and
>>> rename the variables depending on the # of variables (or columns) in
>>> the file.  Then when you append them, they will stack/append properly.
>>> (If the variable format is an issue, you can read the file into
>>> memory (-insheet-) and then -save- a Stata version with the variables
>>> all stored as string to ensure that variables with the same names
>>> append properly -- or use the 'force' option in your -append-
>>> command).
>>> 
>>> 2. A second comment about your example code is that you use -dir- to
>>> build a filelist and then read the .csv file names from it -- you can
>>> do this much more directly using the macro extended function (-help
>>> extended_fcn-) 'dir'.
>>> 
>>> Below is some (untested) code to append depending on the # of
>>> variables in the file (since you say that is the decision rule for how
>>> these variables are stacked) by using the returned value from
>>> -describe- and the macro extended functions to build your file list.
>>> This code assumes that in your files with two vars you've got the
>>> variables 'var1' and 'var2' and in the files with one var you want to
>>> rename the single variable 'var1' since you say that you want that
>>> variable to append to the first column in the 2 variable files.  (You
>>> can change this decision rule below if that's not what you meant)
>>> Note that I -tostring- all the vars for the -append-; you may not need
>>> this step.
>>> 
>>> 
>>> **********************!
>>> 
>>> clear
>>> sa "master.dta", replace emptyok  //here is your 'master' file
>>> 
>>> **build the filelist ==>
>>> local files: dir `"`c(pwd)'"' files "*.csv", respectcase nofail
>>> di `"`files'"' //this is your file list
>>> local files: subinstr local files ".csv" "", all
>>> di `"`files'"' //scrubs the extension so that we can use it below
>>> 
>>> ****loop over file list & append==>
>>> foreach f in `files' {
>>>  *--this part will clean the csv file and make a stata version:
>>>       preserve
>>>               insheet using "`f'.csv", comma clear names
>>>               qui desc
>>>               di `"`r(k)'"' //contains # of vars
>>>               if `r(k)' == 1 {
>>>                       ds
>>>                       rename `r(varlist)' var1
>>>                       }
>>>               tostring *, replace
>>>               sa "`f'.dta", replace // you'll use this for the append
>>>       restore
>>>  *--now append the cleaned file to the master:
>>>       append using "`f'.dta"
>>>       save "master.dta", replace
>>>       } //go back to the top and clean, save, append next file..
>>> 
>>> 
>>> **here's your appended master file==>
>>> u "master.dta", clear
>>> desc
>>> **********************!
>>> 
>>> 
>>> 
>>> - - Eric
>>> 
>>> __
>>> Eric A. Booth
>>> Public Policy Research Institute
>>> Texas A&M University
>>> ebooth@ppri.tamu.edu
>>> +979.845.6754
>>> 
>>> On Mar 31, 2012, at 8:59 AM, Huaandrew wrote:
>>> 
>>>> Hello everybody,
>>>> 
>>>> I am using Stata 12 to create a large master file which is created by appending 3,000 csv files. The problem is that the 3,000 files have two kinds of structure (mixed together and no way to separate them by file names): one has only one column and the other two columns. For the two-column files, the first column is character variable, while the second column is numeral(character numeric), such as "12,356". For the one-column files, the column is numeral as in the two-column file. What Stata commands I should use to distinguish the two kinds of files when appending them into the master file. What I'd like to do is that if it's one-column file, let Stata create a new column which is just like the first column in the two-column file.The original code without distinguishing the two kinds of files is as follows. When I ran it, Stata clashed when reading the first one-column file after reading several two-column files.
>>>> 
>>>> cd d:\Stata_Data\found!
>>>> dir *.csv /a-d /b>d:\Stata_Data\filelist.txtfile
>>>> open myfile using d:\Stata_Data\filelist.txt, read
>>>> file read myfile line
>>>> insheet using `line', comma
>>>> gen hpl_id = substr("`line'",1,22)
>>>> save `line'.dta, replace
>>>> save master_data.dta, replace
>>>> drop _all
>>>> file read myfile line
>>>> while r(eof)==0{
>>>> insheet using `line', comma
>>>> gen hpl_id = substr("`line'",1,22)
>>>> save `line'.dta, replace
>>>> append using master_data.dta
>>>> save master_data.dta, replace
>>>> drop _all
>>>> file read myfile line}
>>>> 
>>>> Thanks!
>>>> 
>>>> Andrew
>> 


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index