|
Note: This FAQ is for users of Stata 6, an older version of Stata.
It is not relevant for more recent versions.
Stata 6: Why were the timings in the American Statistician (August 1997)
review of the svy commands so slow?
|
Title
|
|
Stata 6: Survey and robust estimators
|
|
Author
|
Bill Sribney, StataCorp
|
|
Date
|
October 1997
|
Here is Table 4 from the AmStat review [Cohen, S.B. (1997) “An Evaluation of
Alternative PC-Based Software Packages Developed for the Analysis of Complex
Survey Data.” American Statistician 51(3): 285-292.]:
Table 4. Approximate Execution Times for PC Software Packages to Produce
Required Output
-------------------------------------------------------------------------------
Type of statistic Stata SUDAAN WesVarPC
-------------------------------------------------------------------------------
Means (n = 34,459) 18 minutes <1 minute 15 minutes
Totals (n = 34,459) 20 minutes a a
Means (n = 28,704) 13 minutes <1 minute 12 minutes
Totals (n = 28,704) 14 minutes a a
Ratios (n = 34,459) 64 minutes <1 minute 33 minutes
Totals (n = 34,459) 23 minutes b b
-------------------------------------------------------------------------------
(a) Included as output with mean estimates.
(b) Included as output with ratio estimates.
Explanation
By default, the svy commands compute the covariance for all combinations of
variables and subgroups. If there are several variables and a lot of
subgroups, this can be a sizable computation.
If the commands are run differently (one variable at a time), one can get
the same output in less time. I estimate that each of the runs in Table 4
of the article could have been done in 10 minutes or less (on the reviewer's
computer), rather than the 13-64 minutes cited in the table.
One likely only wants the covariances of different subgroups for the same
variable. One can get only these covariances by estimating for one
variable per command. This is significantly faster when there are lots of
subgroups (> ~10).
For example, consider the command:
. svymean totalexp totalsp1 totalsp2, by(age race3 smpsexr)
Age has 5 categories, race3 has 3, and smpsexr has 2. Thus there are a
total of 5 x 3 x 2 = 30 subgroups. There are 3 variables, and the svymean
command, in the course of estimating the 90 means, by default, also computes
the 90 x 90 covariance matrix (90*91/2 = 4095 elements).
The covariances are useful if you then want to estimate the subgroup
differences using the svylc command. However, it is unlikely that
you want to estimate the difference of totalexp and totalsp1 in different
subgroups; rather you only want to estimate the differences of totalexp (and
separately totalsp1 and totalsp2) in different subgroups. Thus you can run
the three commands
. svymean totalexp, by(age race3 smpsexr)
. svymean totalsp1, by(age race3 smpsexr)
. svymean totalsp2, by(age race3 smpsexr)
Since each of these commands computes only a 30 x 30 covariance matrix, it
is faster (only 30*31/2 + 30*31/2 + 30*31/2 = 465 + 465 + 465 = 1395
elements are computed).
If you don’t care about the covariances (and don’t intend to
estimate subgroup differences), you can simply use the available
option.
. svymean totalexp totalsp1 totalsp2, by(age race3 smpsexr) available
The available option automatically does computations one variable at
a time. It is equivalent to running the three commands above, with one
variable per command (in fact, this is what the code actually does).
The ratio timings in the review were particularly slow since 6 ratios for
each subgroup were computed in each command. For example, for the svyratio
command run by(age race3 smpsexr), 6 x 30 = 180 ratios were computed, along
with their 180 x 180 covariance matrix.
Duplicating the AmStat review timings
I have attempted to duplicate two of the timings from Table 4 (p. 289) of the
AmStat review.
- the timing for means
- the timing for ratios
I duplicated the runs using the commands that the reviewer did and also
with a set of commands that produced the same output but much faster.
I used simulated data based on the description of the data in the review, so
the number of observations and number of subgroups were the same. These are
the main factors that affect the timings, so there should not be much
difference due to the different data sets.
Table I: My timings compared to reviewer's timings
-------------------------------------------------------------------------------
My timings(2)
---------------------------------
Duplication Faster set
Statistic Reviewer(1) of reviewer's runs of commands fast/slow
A B C C/B
----------------------------------------------------------------------
Means 18 min. 2.80 min.(3) 1.87 min.(4) 67 %
Ratios 64 min. 62.1 min.(5) 8.9 min.(6) 14 %
-------------------------------------------------------------------------------
Notes:
- Reviewer used a 75-MHz Pentium, 16M RAM. I believe the OS was
Windows 3.1; it did not say in the article. Virtual memory is NOT an
issue. The data set fit in within available RAM.
- I used a 233-MHz Pentium, 64M RAM, running Windows 95.
- See Table III below for the commands used; these commands for "means" were
given explicitly in the review (p. 288).
- See Table IV below for the commands used.
- See Table V below for the commands used. The commands for the "ratios"
were inferred from the article. Estimates were obtained for the same
subgroups for the "ratios" as for the "means", as the article stated.
- See Table VI below for the commands used.
|
Comments:
I am baffled as to why the reviewer’s “ratios” run and my
duplication of it had about the same timings, whereas our
“means” runs where so different. I'd expect my runs to be about
three times faster given my faster machine.
Table II: Some command-by-command comparisons
-------------------------------------------------------------------------------
Command Time
------------------------------------------------------------------------------
svymean totalexp totalsp1 totalsp2, by(age race3 smpsexr) 66.5 sec.
svymean totalexp, by(age race3 smpsexr) 8.8 sec.
svymean totalsp1, by(age race3 smpsexr) 8.1 sec.
svymean totalsp2, by(age race3 smpsexr) 8.0 sec.
---------
total 24.9 sec.
------------------------------------------------------------------------------
svyratio totalsp1/totalexp totalsp2/totalexp totalsp3/totalexp
totalsp4/totalexp totalsp5/totalexp totalsp6/totalexp,
by(age race3 smpsexr) 35.6 min.
svyratio totalsp1/totalexp, by(age race3 smpsexr) 0.49 min.
svyratio totalsp2/totalexp, by(age race3 smpsexr) 0.49 min.
svyratio totalsp3/totalexp, by(age race3 smpsexr) 0.48 min.
svyratio totalsp4/totalexp, by(age race3 smpsexr) 0.48 min.
svyratio totalsp5/totalexp, by(age race3 smpsexr) 0.48 min.
svyratio totalsp6/totalexp, by(age race3 smpsexr) 0.53 min.
----------
total 2.95 min.
-------------------------------------------------------------------------------
Table III: Commands used by reviewer for means (from p. 288 of review)
-------------------------------------------------------------------------------
Command Number of subgroups
-----------------------------------------------------------------------------
1. svymean totalexp totalsp1 totalsp2 0
2. svymean totalexp totalsp1 totalsp2, by(age) 5
3. svymean totalexp totalsp1 totalsp2, by(smpsexr) 2
4. svymean totalexp totalsp1 totalsp2, by(race3) 3
5. svymean totalexp totalsp1 totalsp2, by(povstal) 5
6. svymean totalexp totalsp1 totalsp2, by(ratehlth) 4
7. svymean totalexp totalsp1 totalsp2, by(ssmsa) 4
8. svymean totalexp totalsp1 totalsp2, by(sregion) 4
9. svymean totalexp totalsp1 totalsp2, by(cendiv) 9
10. svymean totalexp totalsp1 totalsp2, by(povstal ratehlth) 5 x 4 = 20
11. svymean totalexp totalsp1 totalsp2, by(age race3) 5 x 3 = 15
12. svymean totalexp totalsp1 totalsp2, by(age smpsexr) 5 x 2 = 10
13. svymean totalexp totalsp1 totalsp2, by(race3 smpsexr) 3 x 2 = 6
14. svymean totalexp totalsp1 totalsp2, by(age race3 smpsexr) 5 x 3 x 2 = 30
-------------------------------------------------------------------------------
Table IV: Faster way to get the same output as Table III commands
-------------------------------------------------------------------------------
1-9. (use same commands as Table III)
10. svymean totalexp, by(povstal ratehlth)
svymean totalsp1, by(povstal ratehlth)
svymean totalsp2, by(povstal ratehlth)
11. svymean totalexp, by(age race3)
svymean totalsp1, by(age race3)
svymean totalsp2, by(age race3)
12-13. (use same commands as Table III)
14. svymean totalexp, by(age race3 smpsexr)
svymean totalsp1, by(age race3 smpsexr)
svymean totalsp2, by(age race3 smpsexr)
-------------------------------------------------------------------------------
Table V: Commands used by reviewer for ratios (as implied by text)
-------------------------------------------------------------------------------
1. svyratio totalsp1/totalexp totalsp2/totalexp totalsp3/totalexp
totalsp4/totalexp totalsp5/totalexp totalsp6/totalexp
2. svyratio " , by(age)
3. svyratio " , by(smpsexr)
4. svyratio " , by(race3)
5. svyratio " , by(povstal)
6. svyratio " , by(ratehlth)
7. svyratio " , by(ssmsa)
8. svyratio " , by(sregion)
9. svyratio " , by(cendiv)
10. svyratio " , by(povstal ratehlth)
11. svyratio " , by(age race3)
12. svyratio " , by(age smpsexr)
13. svyratio " , by(race3 smpsexr)
14. svyratio " , by(age race3 smpsexr)
-------------------------------------------------------------------------------
Table VI: Faster way to get the same output as Table V commands
-------------------------------------------------------------------------------
1-8. (use same commands as Table V)
9. svyratio totalsp1/totalexp, by(cendiv)
svyratio totalsp2/totalexp, by(cendiv)
svyratio totalsp3/totalexp, by(cendiv)
svyratio totalsp4/totalexp, by(cendiv)
svyratio totalsp5/totalexp, by(cendiv)
svyratio totalsp6/totalexp, by(cendiv)
10. svyratio totalsp1/totalexp, by(povstal ratehlth)
svyratio totalsp2/totalexp, by(povstal ratehlth)
...
11. svyratio totalsp1/totalexp, by(age race3)
svyratio totalsp2/totalexp, by(age race3)
...
12. svyratio totalsp1/totalexp, by(age smpsexr)
svyratio totalsp2/totalexp, by(age smpsexr)
...
13. svyratio totalsp1/totalexp, by(race3 smpsexr)
svyratio totalsp2/totalexp, by(race3 smpsexr)
...
14. svyratio totalsp1/totalexp, by(age race3 smpsexr)
svyratio totalsp2/totalexp, by(age race3 smpsexr)
...
-------------------------------------------------------------------------------
The bottom line
Even if you run the commands as suggested above (Tables IV and VI),
Stata’s svy commands are still slower than the equivalent
SUDAAN runs. Unless there are dozens of subgroups, this difference should be
only a few minutes.
This is because of the following:
- The svy commands always compute the covariance (even the available
option computes the covariances for each variable, only they are not
saved).
- For subgroups defined by multiple variables (e.g. group1 x group2 x
group3), you must run multiple svy commands to duplicate all the SUDAAN
output. There are a few seconds overhead with each command (sorting
data, etc.), so this adds a minute or so to the timings.
(1) will be changed in the next release of Stata, so
the available option computes no covariances whatsoever.
Note
I, not the reviewer, was responsible for the way the commands were run in
the review. I told the reviewer to put more than one variable in each
command, since he was counting the total number of commands as a measure of
“ease of application”.
As shown above, this is a slightly more efficient way to run the commands
when there are only a few subgroups (< ∼10), but not when there are
lots of subgroups. I overlooked this fact when advising the reviewer.
|
FAQs
What's new?
Statistics
Data management
Graphics
Programming Stata
Mata
Resources
Internet capabilities
Stata for Windows
Stata for Unix
Stata for Mac
Technical support
|