Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Re: Converting a SAS datastep to Stata

From   Daniel Feenberg <>
Subject   Re: st: Re: Converting a SAS datastep to Stata
Date   Tue, 14 Dec 2010 07:14:30 -0500 (EST)

On Mon, 13 Dec 2010, Kevin Geraghty wrote:

FYI, I tried this to satisfy my own curiosity; it works. Probably the most parsimonious approach.
assuming your dataset has a variable "year" defined, taking values from 1993 through 2008, and the values specified for "exmp" are in the correct ascending year order.

matrix input exmp=(2350, 2450, 2500, 2550, 2650, 2700, 2750, 2800, 2900, 3000, 3050, 3100, 3200, 3300, 3400, 3500)
gen int exemption = exmp[1,year-1992]

Thank you! I just repeated your test. This will be much more efficient than the ways I had thought of or have been suggested so far. I am very pleased that it doesn't require a wholesale conversion to Mata - I get to keep the implicit index over returns and I can generate the code above directly from the existing SAS code.

Sorry I got the number of elements to the vector wrong in my SAS example - the number of initializers should equal the number of array elements or I will get the wrong answer.

BTW, anyone looking for the existing stata callable fortran version should

   net from "";
   net describe taxsim9

That version covers federal tax from 1960-2013 and state income taxes from 1979-2009.

Thanks again,

Daniel Feenberg

----- "Joseph Coveney" <> wrote:

Daniel Feenberg wrote:

I have done programs to calculate income tax liability in SAS and
Both those languages allow tax parameters that vary across years and
filing status to be held in initialized arrays. For example, in SAS
could declare:

    array exmp(1993:2010) _temporary_;
    retain exmp 2350 2450 2500 2550 2650 2700 2750 2800 2900 3000 3050
                3200 3300 3400 3500;

and then assigning the correct value of the personal exemption to
individual record is just:

    exemption = exmp(fldpyr);

where flpdyr is a variable in the data with the filing year. I am at a
of a loss as to how to do this in Stata. I don't like:

    gen exemption = (flpdyr==1993)*2350 + (flpdyr==1994)*2450...(for
subexpressions in all)


    gen     exemption = 2350, if flpdyr==1993
    replace exemption = 2450, if flpdyr==1994
    ...(for 18 lines in all)...

because these require (and execute) so much repetitive code.

Another possibility is to create a dataset of parameters by year and
filing status, then sort the tax return data by year and filing
and finally merge the parameters onto the tax return data. But that
requires a sort and a lot of I/O, which could be slow with potentially

millions of returns. The additional memory required is probably not a

I don't actually know Mata, but I think I could define a rowvector:

     exmp =  ( 2350 2450 2500 2550 2650 2700 2750 2800 2900 3000 3050
                3200 3300 3400 3500);

and then loop over all the tax returns executing:

     exemption[i] = exmp[flpdyr[i]-1992];

for each return (where i indexes returns). That seems to mean that
variable is going to have to carry around a [i] subscript and there
be a 1,000 lines of Mata code executed for each return (rather than
preferred 1,000 lines of code for all the returns together). That is
less attractive than leaving the observation number implicit, as the
regular Stata language does. Brief study of [M-2]subscripts doesn't
suggest any "matrixy" way of coding this.

I expect I am missing something obvious, can someone point me in the


The number of years is limited and they're integers, so you could
probably get
away with value labels and a one-shot work-up (see below).  This
approach might be faster than any -merge- (with its implicit -sort-)
when you
have millions of observations in the tax-record dataset.

I'd bet that becoming familiar with Mata's -asarray()- (think: Paul
will be more gratifying in the long run.

Joseph Coveney

P.S.  What does SAS do when you have more index values (18 years) than
values (16 exemptions)?  Does it pad the last value out to the end of
the array,
or recycle à la R?

version 11.1

clear *
set more off
set obs 18
generate int year = 1992 + _n

* Begin here
local value_label label define Exemptions
local year 1993
foreach exemption in 2350 2450 2500 2550 2650 ///
    2700 2750 2800 2900 3000 3050 3100 3200 ///
    3300 3400 3500 3550 3600 {
    local value_label `value_label' `year' "`exemption'"
	local ++year
label values year Exemptions
decode year, generate(exemption)
_strip_labels year
destring exemption, replace
list, noobs abbreviate(20) separator(0)

*   For searches and help try:

*   For searches and help try:

© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index