Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down at the end of May, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Converting a SAS datastep to Stata


From   "Michael N. Mitchell" <Michael.Norman.Mitchell@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Converting a SAS datastep to Stata
Date   Mon, 13 Dec 2010 18:04:14 -0800

Dear Daniel

I am sure others will have other ideas, but the strategy that I think of is creating a dataset with the "array" data and then match merging the master dataset with the "array" dataset, merging on the tax year. For example

. input id taxyear income

            id    taxyear     income
  1. 1 1998 50000
  2. 2 1997 34000
  3. 3 1995 44321
  4. end

. save master, replace
file master.dta saved

.
. clear

. input taxyear exmp

       taxyear       exmp
  1. 1993 2350
  2. 1994 2450
  3. 1995 2500
  4. 1996 2550
  5. 1997 2650
  6. 1998 2700
  7. 1999 2750
  8. 2000 2800
  9. 2001 2900
 10. 2002 3000
 11. 2003 3050
 12. 2004 3100
 13. 2005 3200
 14. 2006 3300
 15. 2007 3400
 16. 2008 3500
 17. end

. save array, replace
file array.dta saved

.
. use master

. merge m:1 taxyear using array

    Result                           # of obs.
    -----------------------------------------
    not matched                            13
        from master                         0  (_merge==1)
        from using                         13  (_merge==2)

    matched                                 3  (_merge==3)
    -----------------------------------------

. keep if _merge == 3
(13 observations deleted)

. sort id

. list

     +--------------------------------------------+
     | id   taxyear   income   exmp        _merge |
     |--------------------------------------------|
  1. |  1      1998    50000   2700   matched (3) |
  2. |  2      1997    34000   2650   matched (3) |
  3. |  3      1995    44321   2500   matched (3) |
     +--------------------------------------------+

I hope that helps,

Michael N. Mitchell
Data Management Using Stata      - http://www.stata.com/bookstore/dmus.html
A Visual Guide to Stata Graphics - http://www.stata.com/bookstore/vgsg.html
Stata tidbit of the week         - http://www.MichaelNormanMitchell.com



On 2010-12-13 4.51 PM, Daniel Feenberg wrote:
I have done programs to calculate income tax liability in SAS and fortran. Both those
languages allow tax parameters that vary across years and filing status to be held in
initialized arrays. For example, in SAS one could declare:

array exmp(1993:2010) _temporary_;
retain exmp 2350 2450 2500 2550 2650 2700 2750 2800 2900 3000 3050 3100
3200 3300 3400 3500;

and then assigning the correct value of the personal exemption to every individual record
is just:

exemption = exmp(fldpyr);

where flpdyr is a variable in the data with the filing year. I am at a bit of a loss as to
how to do this in Stata. I don't like:

gen exemption = (flpdyr==1993)*2350 + (flpdyr==1994)*2450...(for 18 subexpressions in all)

or

gen exemption = 2350, if flpdyr==1993
replace exemption = 2450, if flpdyr==1994
...(for 18 lines in all)...

because these require (and execute) so much repetitive code.

Another possibility is to create a dataset of parameters by year and filing status, then
sort the tax return data by year and filing status, and finally merge the parameters onto
the tax return data. But that requires a sort and a lot of I/O, which could be slow with
potentially millions of returns. The additional memory required is probably not a big issue.

I don't actually know Mata, but I think I could define a rowvector:

exmp = ( 2350 2450 2500 2550 2650 2700 2750 2800 2900 3000 3050 3100
3200 3300 3400 3500);

and then loop over all the tax returns executing:

exemption[i] = exmp[flpdyr[i]-1992];

for each return (where i indexes returns). That seems to mean that every variable is going
to have to carry around a [i] subscript and there will be a 1,000 lines of Mata code
executed for each return (rather than the preferred 1,000 lines of code for all the
returns together). That is much less attractive than leaving the observation number
implicit, as the regular Stata language does. Brief study of [M-2]subscripts doesn't
suggest any "matrixy" way of coding this.

I expect I am missing something obvious, can someone point me in the right direction?

Thanks

Daniel Feenberg
NBER
Cambridge MA
feenberg@nber.org


*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index