Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: repeat same commands over hundreds of files


From   Eric Booth <ebooth@ppri.tamu.edu>
To   "<statalist@hsphsun2.harvard.edu>" <statalist@hsphsun2.harvard.edu>
Subject   Re: st: repeat same commands over hundreds of files
Date   Tue, 2 Nov 2010 21:11:57 +0000

<>

Based on your description, I think the solution is probably closer to my second example in my initial response.  See if this works:

*********!
**i like to use globals for filepaths - this isn't necessary**
global sf   "/Users/tbrunell/MPG//"

**grab all folders**
global folders:  dir  "$sf" dirs "*", respectcase
 di `"$folders'"'   // these should be all your state subfolders


foreach f  of global folders  {
	di "Folder: `f'"

**grab all files in each folder**
global files: dir `"$sf/`f'"' files "*.csv", respectcase
di in green `"$files"'   //make sure this worked


**filter out the file extension so that we can save it as .dta**
global files: subinstr global files  ".csv" "", all
di in yellow `"$files"'   // make sure this worked


token `"$files"'

while `"`1'"' != "" {

cap confirm file  "$sf//`f'//`1'.csv"
   if !_rc {

clear
insheet using ""$sf//`f'//`1'.csv"
drop in L /*this drops file notation at the bottom*/
compress
gen demper=dem/(dem+rep)
gen demwin=.
replace demwin=1 if demper>.5 & demper~=.
replace demwin=0 if demper<.5
sort rkey
gen overalldemper=overalldem/(overalldem+overallrep)
collapse (count) numberofseats=demper (sum) demwin (mean) year demper overalldemper (p50) median=demper,by(rkey)
gen percentdemdist=demwin/numberofseats


save "$sf//`f'//`1'.dta", replace

}

else {

/* 
note:  the -confirm- if /else loop isn't  really necessary when
using this approach (its better applied when using
the forvalues loop approach I described earlier), but I left it in and it helped
me diagnose when I was missing a `/f' in one of my paths, so 
I left  it here since it doesnt get in the way -- you can take it out 
if you don't want it
*/

di "file for `n' doesnt exist!"
              }
       }

mac shift

}

************!

- Eric 

__
Eric A. Booth
Public Policy Research Institute
Texas A&M University
ebooth@ppri.tamu.edu

On Nov 2, 2010, at 3:42 PM, tbrunell wrote:

> These are very helpful, to be more specific my files structure will have
> 
> /Users/tbrunell/MPG/  as the root
> then there are currently 50 subfolders one for each state AL, AR,...WY.
> The file names look like
> mpg_09_CTC1972_1972_EDCD11_10_JH22.csv
> mpg_09_CTC1972_1974_EDCD11_10_JH22.csv
> mpg_09_CTC1972_1976_EDCD11_10_JH22.csv
> mpg_09_CTC1972_1978_EDCD11_10_JH22.csv
> mpg_09_CTC1972_1980_EDCD11_10_JH22.csv
> mpg_09_CTC1982_1982_EDCD11_10_JH22.csv
> mpg_09_CTC1982_1984_EDCD11_10_JH22.csv
> mpg_09_CTC1982_1986_EDCD11_10_JH22.csv
> mpg_09_CTC1982_1988_EDCD11_10_JH22.csv
> mpg_09_CTC1982_1990_EDCD11_10_JH22.csv
> 
> the things that change across states are
> the number after MPG which is a state number
> The 3 letters before the first year CTC is CT Congress, TXS would be texas senate, etc
> the first year is the redistricting regime, usually a year ending in 2
> then the second year is the election year
> 
> I could do several things like
> 1) not have 50 separate folders, just keep everything in one folder
> 2) rename all the input files.
> 
> Though I must admit I would prefer not to do either of those things.
> 
> Thanks for your help
> 
> 
> 
> 
> On Nov 2, 2010, at 3:31 PM, Eric Booth wrote:
> 
>> <>
>> 
>> One other note:  if your files are sequentially numbered but there are gaps (as there are in my example of filenames), you might want to put in a -confirm- statement to capture whether the file exists and skip it if it doesn't exist.  So, modifying my prev. example, you'd want something like this:
>> 
>> *********!
>> forval n = 1972/1981 {
>> 
>> cap confirm file  "/Users/tbrunell/MPG/CT/mpg_09_CTC`n'_`n'_EDCD11_10_JH22.csv"
>>    if !_rc {
>> 
>> clear
>> insheet using "/Users/tbrunell/MPG/CT/mpg_09_CTC`n'_`n'_EDCD11_10_JH22.csv"
>> drop in L /*this drops file notation at the bottom*/
>> compress
>> gen demper=dem/(dem+rep)
>> gen demwin=.
>> replace demwin=1 if demper>.5 & demper~=.
>> replace demwin=0 if demper<.5
>> sort rkey
>> gen overalldemper=overalldem/(overalldem+overallrep)
>> collapse (count) numberofseats=demper (sum) demwin (mean) year demper overalldemper (p50) median=demper,by(rkey)
>> gen percentdemdist=demwin/numberofseats
>> 
>> 
>> **create a macro for the decade**
>> local save
>> if inrange(`n', 1970, 1979) local save 1970
>> if inrange(`n', 1980, 1989) local save 1980 
>> 
>> 
>> save "/Users/tbrunell//MPG/CT/CTC`save's", replace
>> 
>> }
>> 
>> else {
>> di "file for `n' doesnt exist!"
>>               }
>> }
>> ************!
>> 
>> - Eric
>> __
>> Eric A. Booth
>> Public Policy Research Institute
>> Texas A&M University
>> ebooth@ppri.tamu.edu
>> 
>> On Nov 2, 2010, at 3:22 PM, Eric Booth wrote:
>> 
>>> <>
>>> 
>>> Hi Tom:
>>> 
>>> The best approach probably depends on how your file names are sequenced and how your folders/files are organized, but programs like -fs- (from SSC) and others are useful for this type of work.  Here's two approaches:
>>> 
>>> 
>>> assuming you've got files named sequentially like this:
>>> 
>>> mpg_09_CTC1972_1972_EDCD11_10_JH22
>>> mpg_09_CTC1973_1973_EDCD11_10_JH22
>>> mpg_09_CTC1974_1974_EDCD11_10_JH22
>>> mpg_09_CTC1975_1975_EDCD11_10_JH22
>>> mpg_09_CTC1981_1981_EDCD11_10_JH22
>>> mpg_09_CTC1982_1982_EDCD11_10_JH22
>>> 
>>> 
>>> 
>>> You could use a -forvalues- loop like:
>>> 
>>> *********!
>>> forval n = 1972/1981 {
>> 
>> cap confirm file  "/Users/tbrunell/MPG/CT/mpg_09_CTC`n'_`n'_EDCD11_10_JH22.csv"
>>    if !_rc {
>>> clear
>>> insheet using "/Users/tbrunell/MPG/CT/mpg_09_CTC`n'_`n'_EDCD11_10_JH22.csv"
>>> drop in L /*this drops file notation at the bottom*/
>>> compress
>>> gen demper=dem/(dem+rep)
>>> gen demwin=.
>>> replace demwin=1 if demper>.5 & demper~=.
>>> replace demwin=0 if demper<.5
>>> sort rkey
>>> gen overalldemper=overalldem/(overalldem+overallrep)
>>> collapse (count) numberofseats=demper (sum) demwin (mean) year demper overalldemper (p50) median=demper,by(rkey)
>>> gen percentdemdist=demwin/numberofseats
>>> 
>>> 
>>> **create a macro for the decade**
>>> local save
>>> if inrange(`n', 1970, 1979) local save 1970
>>> if inrange(`n', 1980, 1989) local save 1980 
>>> 
>>> 
>>> save "/Users/tbrunell//MPG/CT/CTC`save's", replace
>>> 
>> }
>> 
>> else {
>> di "file for `n' doesnt exist!"
>>               }
>> }
>>> ************!
>>> 
>>> Note the use of the local macros to create the decade for the -save- filename.
>>> 
>>> 
>>> 
>>> Another approach is to just find all the .csv files in your folder (or alternatively this could be done to find all the folders of interest and all the .csv files in all the folders of interest) using the macro extended functions (see -help extended_fcn-)  and run the code on all of them , e.g., 
>>> 
>>> *************!
>>> global files:dir "<folder path>" files "*.csv", respectcase
>>> token `"$files"'
>>> di in yellow `"$files"'
>>> 
>>> while "`1'" != "" {
>>> 	clear
>>> 	insheet using "/Users/tbrunell/MPG/CT/`1'.csv"
>>> 	<snip>
>>> 	save "/Users/tbrunell//MPG/CT/`1'.dta", replace
>>> 
>>> macro shift
>>> }
>>> ***************!
>>> 
>>> 
>>> 
>>> - Eric
>>> __
>>> Eric A. Booth
>>> Public Policy Research Institute
>>> Texas A&M University
>>> ebooth@ppri.tamu.edu
>>> 
>>> 
>>> P.S.  Say "Hi" to Dave Smith for me if he's still around there.
>>> 
>>> 
>>> 
>>> 
>>> On Nov 2, 2010, at 2:57 PM, tbrunell wrote:
>>> 
>>>> I am doing some simple analysis on election data that spans all the states and several decades.
>>>> So I have hundreds of files that I want to do the same relatively simple analysis on (I have an example below).
>>>> At first I started writing .do files for each state/year and the only things I changed were the 
>>>> 1) file name for the insheet command
>>>> 2) the name and location of the collapsed file at the end.
>>>> 
>>>> However, when I wanted to add an additional command this meant opening hundreds of separate .do files, making a change, resaving the file.  It is not the end of the world, but I would prefer to set up the commands and then, somehow, tell stata to run the commands separately for each specified file and then save the resulting file with some new name.
>>>> 
>>>> The techs at Stata recommended using macros for file names and the foreach command.  But that doesn't solve my filename and output file problem.
>>>> 
>>>> Any recommendations would be much appreciated.
>>>> 
>>>> Tom Brunell
>>>> Professor of Political Science
>>>> University of Texas at Dallas
>>>> 
>>>> _____________________________
>>>> clear
>>>> insheet using "/Users/tbrunell/MPG/CT/mpg_09_CTC1972_1972_EDCD11_10_JH22.csv"
>>>> drop in L /*this drops file notation at the bottom*/
>>>> compress
>>>> 
>>>> gen demper=dem/(dem+rep)
>>>> gen demwin=.
>>>> replace demwin=1 if demper>.5 & demper~=.
>>>> replace demwin=0 if demper<.5
>>>> sort rkey
>>>> gen overalldemper=overalldem/(overalldem+overallrep)
>>>> 
>>>> *here overalldemper will be total votes percentage, demper is "normalized" vote - averaged across districts
>>>> collapse (count) numberofseats=demper (sum) demwin (mean) year demper overalldemper (p50) median=demper,by(rkey)
>>>> gen percentdemdist=demwin/numberofseats
>>>> 
>>>> save "/Users/tbrunell//MPG/CT/CTC1970s", replace
>> 
>> 



*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index