Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

re: st: Use of matrix values in generate statements

From	Christopher Baum <[email protected]>
To	"[email protected]" <[email protected]>
Subject	re: st: Use of matrix values in generate statements
Date	Sat, 26 Mar 2011 19:55:14 -0400

<>
Dan says

I continue to work on a tax calculator for Stata.

I am at the point of calculating the standard deduction for each taxpayer. 
There are 6 possible filing status's and 24 years of tax law, so there are 
144 possible values for the deduction. In SAS, fortran, PL/1, C or any 
other language I know of, the calculation would be some form of:

    stded = stdvalues(year,filestat)

and the processor would index into the 24x6 array of stdvalues to obtain 
the value for each taxpayer. As I understand it, Stata matricies can't be 
used in -generate- statements, though, so I can't do something like:

    matrix input stdvalues (3700 6200...\3800 6350...\...
    generate stded = stdvalues[year-1992,filestat]

(Here and below, ... is meant to conceal a lot of typing on my part but 
3700 is the deduction in 1993 for a single taxpayer, 6350 is the deduction 
in 1994 for a joint return, etc). The most straightforward way I can see 
to calculate the deduction in Stata would be:

   generate   stded = 3700 if year == 1993 & filestat == 1
   replace    stded = 6200 if year == 1993 & filestat == 2
   ...

and so forth, for 144 lines. I have millions of observations, and will 
make thousands of runs, so I am looking for a more efficient solution. My 
next thought is:

   generate stded = (year==1993&filestat==1)*3700+(year==1993&filestat==2)*6200...

which would be one very long line of code once all 144 terms were written 
out, and still quite a bit of wasted arithmetic.  Still a third 
possibility would be -recode-:

   gen filestatyear = year*10+filestat
   recode filestatyear (19931 = 3700)(19932 = 6200)...

but looking at the -recode- .ado file suggests that this is not an 
efficiency gain.

I take it I am supposed to -sort- the data by year and filestat, and then 
-merge- onto a file of parameter values by year and filestat:

   sort year filestat
   merge m:1 year filestat using params

where params is a dataset with the deduction amount for each year and 
filestat. This is a reasonable amount of code, (even including the code 
necessary to create params) but it is not space efficient and it strikes 
me as odd that a large dataset needs to be sorted, just to make some 
simple recodes. Is that right? Am I missing something?

I note that the -egen- command -mtr- must address this same question, but 
it is not very fast - about 1,000 observations/minute on our hardware.

Oddly enough, although one cannot index into a Stata matrix, it is 
possible to index into a series observation:

     generate stded = stdvalues[filestatyear-199200]

is very fast, but doesn't address the problem of filling stdvalues in a 
not too hackish manner (especially if there are fewer than 144 taxpayers 
in the dataset).



The following code will do 1 million table lookups in 8 or 9 seconds on my laptop:

---------------------------------
clear all
// fake data for lookup table
mata: sdlookup = 100*runiform(24,6) :+ 3200

set obs 10
input year fs
1994 1
1998 2
1999 1
2000 6
2000 5
2005 3
2004 4
1996 2
2008 5
2007 3
expand 100000
g byte yrind = year - 1992
g stded = .
set rmsg on
mata
st_view(yrfs=., ., ("yrind","fs"))
st_view(stded=., . , "stded")
for(i=1; i<=rows(stded); i++) {
	stded[i] = sdlookup[yrfs[i,1], yrfs[i,2]]
}
end
su stded
---------------------------------

Kit

Kit Baum   |   Boston College Economics & DIW Berlin   |   http://ideas.repec.org/e/pba1.html
                              An Introduction to Stata Programming  |   http://www.stata-press.com/books/isp.html
   An Introduction to Modern Econometrics Using Stata  |   http://www.stata-press.com/books/imeus.html




*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: Use of matrix values in generate statements
  - From: Nick Cox <[email protected]>

Prev by Date: re: Re: st: RE: ivregress with2sls and clustered standard errors
Next by Date: Re: st: Use of matrix values in generate statements
Previous by thread: st: Use of matrix values in generate statements
Next by thread: Re: st: Use of matrix values in generate statements
Index(es):
- Date
- Thread