Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
st: Use of matrix values in generate statements
From
Daniel Feenberg <[email protected]>
To
[email protected]
Subject
st: Use of matrix values in generate statements
Date
Sat, 26 Mar 2011 16:50:10 -0400 (EDT)
I continue to work on a tax calculator for Stata.
I am at the point of calculating the standard deduction for each taxpayer.
There are 6 possible filing status's and 24 years of tax law, so there are
144 possible values for the deduction. In SAS, fortran, PL/1, C or any
other language I know of, the calculation would be some form of:
stded = stdvalues(year,filestat)
and the processor would index into the 24x6 array of stdvalues to obtain
the value for each taxpayer. As I understand it, Stata matricies can't be
used in -generate- statements, though, so I can't do something like:
matrix input stdvalues (3700 6200...\3800 6350...\...
generate stded = stdvalues[year-1992,filestat]
(Here and below, ... is meant to conceal a lot of typing on my part but
3700 is the deduction in 1993 for a single taxpayer, 6350 is the deduction
in 1994 for a joint return, etc). The most straightforward way I can see
to calculate the deduction in Stata would be:
generate stded = 3700 if year == 1993 & filestat == 1
replace stded = 6200 if year == 1993 & filestat == 2
...
and so forth, for 144 lines. I have millions of observations, and will
make thousands of runs, so I am looking for a more efficient solution. My
next thought is:
generate stded = (year==1993&filestat==1)*3700+(year==1993&filestat==2)*6200...
which would be one very long line of code once all 144 terms were written
out, and still quite a bit of wasted arithmetic. Still a third
possibility would be -recode-:
gen filestatyear = year*10+filestat
recode filestatyear (19931 = 3700)(19932 = 6200)...
but looking at the -recode- .ado file suggests that this is not an
efficiency gain.
I take it I am supposed to -sort- the data by year and filestat, and then
-merge- onto a file of parameter values by year and filestat:
sort year filestat
merge m:1 year filestat using params
where params is a dataset with the deduction amount for each year and
filestat. This is a reasonable amount of code, (even including the code
necessary to create params) but it is not space efficient and it strikes
me as odd that a large dataset needs to be sorted, just to make some
simple recodes. Is that right? Am I missing something?
I note that the -egen- command -mtr- must address this same question, but
it is not very fast - about 1,000 observations/minute on our hardware.
Oddly enough, although one cannot index into a Stata matrix, it is
possible to index into a series observation:
generate stded = stdvalues[filestatyear-199200]
is very fast, but doesn't address the problem of filling stdvalues in a
not too hackish manner (especially if there are fewer than 144 taxpayers
in the dataset).
Daniel Feenberg
NBER
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/