Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: st: Converting a SAS datastep to Stata

From   Nick Cox <>
To   "''" <>
Subject   RE: st: Converting a SAS datastep to Stata
Date   Tue, 14 Dec 2010 10:50:32 +0000

"almost certainly a performance disaster": that's more drastic a judgement than my experience supports. Finding the right code for an unfamiliar approach can often be more time-consuming than the extra computing time. Naturally, I agree that doing this repeatedly and/or doing it for very large files might justify search for a really efficient solution. 


Kevin Geraghty

So you want some lookup function, basically. My take, for what it's worth, is that the merge is your best option. You could write an ado program pseudo lookup-function, and loop through the file invoking it for every observation of your data set, but that is almost certainly a performance disaster.

The merge ain't so bad. It's just one statement. You don't have to sort your data beforehand, and there is no i/o. That is the whole deal with stata, that data sets are held entirely in memory.  

tempfile mylookuptable

<create and save your temporary data here>
save `mylookuptable', replace

*now read your "real" data set "mybigfile.dta into memory
use mybigfile, clear

*now merge it with your temporary lookup table dset
merge m:1 fldpyr using `mylookuptable', assert(master match) nogenerate

I don't know Mata hardly at all, either,
but there is one possibility worth investigating, which is you define your 1X18 array "exmp" of exemptions, but 
instead of writing 

gen exemption=.
forvalues i = 1/`c(N)'
 replace exemption[i] = exmp[flpdyr[i]-1992,1]
you just write 

gen exemption = exmp[flpdyr-1992,1] 

That is, you do not explicitly need to loop through your records. I'd at least make the experiment, although I'm not sure it'll work. Note also that there are no one-dimensional arrays in Mata, they are just one-row matrices or one-column matrices, so you have to specify two indices when you refer to an element, even if the row or column element is always "1"

----- "Daniel Feenberg" <> wrote:

> I have done programs to calculate income tax liability in SAS and
> fortran. 
> Both those languages allow tax parameters that vary across years and 
> filing status to be held in initialized arrays. For example, in SAS
> one 
> could declare:
>     array exmp(1993:2010) _temporary_;
>     retain exmp 2350 2450 2500 2550 2650 2700 2750 2800 2900 3000 3050
> 3100
>                 3200 3300 3400 3500;
> and then assigning the correct value of the personal exemption to
> every 
> individual record is just:
>     exemption = exmp(fldpyr);
> where flpdyr is a variable in the data with the filing year. I am at a
> bit 
> of a loss as to how to do this in Stata. I don't like:
>     gen exemption = (flpdyr==1993)*2350 + (flpdyr==1994)*2450...(for
> 18 subexpressions in all)
> or
>     gen     exemption = 2350, if flpdyr==1993
>     replace exemption = 2450, if flpdyr==1994
>     ...(for 18 lines in all)...
> because these require (and execute) so much repetitive code.
> Another possibility is to create a dataset of parameters by year and 
> filing status, then sort the tax return data by year and filing
> status, 
> and finally merge the parameters onto the tax return data. But that 
> requires a sort and a lot of I/O, which could be slow with potentially
> millions of returns. The additional memory required is probably not a
> big 
> issue.
> I don't actually know Mata, but I think I could define a rowvector:
>      exmp =  ( 2350 2450 2500 2550 2650 2700 2750 2800 2900 3000 3050
> 3100
>                 3200 3300 3400 3500);
> and then loop over all the tax returns executing:
> for each return (where i indexes returns). That seems to mean that
> every 
> variable is going to have to carry around a [i] subscript and there
> will 
> be a 1,000 lines of Mata code executed for each return (rather than
> the 
> preferred 1,000 lines of code for all the returns together). That is
> much 
> less attractive than leaving the observation number implicit, as the 
> regular Stata language does. Brief study of [M-2]subscripts doesn't 
> suggest any "matrixy" way of coding this.
> I expect I am missing something obvious, can someone point me in the
> right 
> direction?

*   For searches and help try:

© Copyright 1996–2017 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index