Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: RE: st: Converting a SAS datastep to Stata


From   Nick Cox <[email protected]>
To   "'[email protected]'" <[email protected]>
Subject   RE: RE: st: Converting a SAS datastep to Stata
Date   Thu, 16 Dec 2010 17:12:25 +0000

Bill wrote here, among much good stuff, 

       program taxyear2003
               local r = `replace'
              `r' _amt5pc = min(c24533,min(c24532,min(c62700,c24517))) 
              `r' _amt5pc = max(0,_amt5pc) 
              `r' c62747 = .05*_amt5pc 
               ...
       end

He didn't mean that. 

The first line is better as 

local r "replace"

What he wrote instead is legal, but at that point in the program no local macro -replace- is defined, so local r will be born empty, and the rest of the program will fail. 

local r = "replace" 

would work but using an = sign here is a habit to avoid whenever no evaluation of the expression defining the macro is needed. 

He also wrote twice 

        local `r' = "replace"

and once 

	local R = "replace" 

and those lines are typos for what is above. They should all start 

	local r 

Nick 
[email protected] 

William Gould, StataCorp LP

Concerning SAS code that he is translating to Stata, Daniel Feenberg
<[email protected]> wrote,

> Repeating the if qualifier means repeating a calculation, which is an 
> inefficiency, but it also means repeating the code, which is ugly and 
> distracting. That is why I asked about the possibility of a block level if 
> qualifier. If it doesn't exist, I'll put it in W Gould's suggestion box.

Daniel made the above comments concerning code he is translating from SAS 
to Stata.  The SAS code reads, 

   if FLPDYR eq 2003 then do;
      _amt5pc = min(c24533,min(c24532,min(c62700,c24517)));
      _amt5pc = max(0,_amt5pc);
      c62747 = .05*_amt5pc;
      _line49 = max(0,min(c24532,min(c24517,c62700)) - _amt5pc);
      _line50 = sum(e24583,0);
      _amt8pc = min(_line49,_line50);
      c62749 = .08*_amt8pc;
      _amt10pc = _line49 - _amt8pc;
      c62750 =  .1*_amt10pc;
      _line55 = c24533 - _amt5pc;
      _line56 = min(c24517,c62700) - min(c24532,min(c24517,c62700));
      _amt15pc = min(_line55,_line56);
      c62755 =  .15*_amt15pc;
      _amt20pc = _line56 - _amt15pc;
      c62760 =  .2*_amt20pc;
      _amt25pc = min(c62700,min(c24517+e24515,c24516))-min(c62700,c24517);
      c62770 =  .25*_amt25pc;
      _tamt2  = c62747 + c62749 + c62750 + c62755 + c62760 + c62770;
    end;

The above is the code code for one of the years, and Daniel has a lot more 
code for eacxh of the other years.

The problem is that Stata puts if qualifiers on end of lines whereas 
SAS puts them out front.  In this case, the resulting SAS code is easier 
to read, and to write.


Solution 1
----------

My first solution addresses the readability issue and allows Daniel to
translate the code with easy-to-apply global edits:

      local R = "replace"

      local if = "if FLPDYR==2003"
      `r' _amt5pc = min(c24533,min(c24532,min(c62700,c24517))) `if'
      `r' _amt5pc = max(0,_amt5pc) `if'
      `r' c62747 = .05*_amt5pc `if' 
      `r' _line49 = max(0,min(c24532,min(c24517,c62700)) - _amt5pc) `if'
      ...
      `r' _tamt2  = c62747 + c62749 + c62750 + c62755 + c62760 + c62770 `if'

      local if = "if FLPYDR==2004"
      `r' ... `if'
      ...

What I did was add `r' to the front of each of the original SAS lines, 
and repalce the the semicolon at the end with `if'.

This solution does not address Daniel's comment about code efficiency (the
reinterpretation of the `if' line by line by line), but does address 
the problem with "ugly and distracting".  

By the way, concerning efficiency, while I agree that reevaluating the if line
by line by line is inefficient, that does not imply that the above Stata code
runs more slowly than the original SAS code.  Stata keeps the data in memory,
and all the rest of Stata has been optimized for that. SAS reads data from
disk, and all the rest of SAS has been optimized for that.  I do not know
which package will be faster in this case.  All I know for sure is is that, as
dataset size grows, the SAS code will slow down less than will the Stata code
I just suggested.


Solution 2 
----------

Let's get rid of the `if' on the end.  The solution below might be more
efficient, but I don't guarantee it.  I'm about to substitute disk I/O for
re-evaluation of the if, and thus make Stata more closely mimic how SAS
operates.  I don't guarantee this solution is faster because, as previously
stated, Stata is very fast at re-evaluating if statements, and because I will
end up substituting more I/O than SAS performs in this case executing it's
code.

I have other reasons for suggesting this solution, which reasons will become
obvious in the telling.

The solution is, 

        local `r' = "replace"
        forvalues yr=1980(1)2008 {
           save hold
           keep if FLPDYR == `yr' 

	   if `yr'==1980 {
	        ...
           }
           else if `yr'==1981 {
	        ...
           }
           ...
           else if `yr'==2003 {
              `r' _amt5pc = min(c24533,min(c24532,min(c62700,c24517))) 
              `r' _amt5pc = max(0,_amt5pc) 
              `r' c62747 = .05*_amt5pc 
               ...
           }
           else ...

           save result, emptyok
           use hold
           drop if FLPDYR==`yr'
           append using hold
        }

Here's an even more readable version of this solution:

        local `r' = "replace"
        forvalues yr=1980(1)2008 {
           save hold
           keep if FLPDYR == `yr' 
           taxyear`yr'
           save result, emptyok
           use hold
           drop if FLPDYR==`yr'
           append using hold
        }

Note the line -taxyear`yr'-.  If `yr'==2003, then that will execute 
the subroutine -taxyear2003-.  Cute, huh?

Then I write subroutines for each of the tax years, such as 

       program taxyear2003
               local r = `replace'
              `r' _amt5pc = min(c24533,min(c24532,min(c62700,c24517))) 
              `r' _amt5pc = max(0,_amt5pc) 
              `r' c62747 = .05*_amt5pc 
               ...
       end


What I like about this solution is that the resulting code is very readable --
perhaps even more readable than the original SAS code -- and it does not
require changing the original SAS code much.


Other solutions
---------------

Daniel could use Mata.  That would address both the readability and efficiency
issues.  If I were writing this code for the first time, that is what I would
do, probably.  With Mata, I can go through the observations one at a time just
as SAS does.

But if I had code already written in SAS, I would use solution 2, version 2.
The changes required by that solution are minimal and I will spend less 
time debugging and convincing myself that I had the same answers as 
previously, than if I started all over again.


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index