Home  /  Resources & support  /  FAQs  /  Stata 6: Survey and robust estimators
Note: This FAQ is for users of Stata 6. It is not relevant for more recent versions.

Stata 6: Why were the timings in the American Statistician (August 1997) review of the svy commands so slow?

Title   Stata 6: Survey and robust estimators
Author Bill Sribney, StataCorp

Here is Table 4 from the AmStat review [Cohen, S.B. (1997) “An Evaluation of Alternative PC-Based Software Packages Developed for the Analysis of Complex Survey Data.” American Statistician 51(3): 285-292.]:

 Table 4.  Approximate Execution Times for PC Software Packages to Produce
 Required Output
 -------------------------------------------------------------------------------
 Type of statistic         Stata           SUDAAN           WesVarPC
 -------------------------------------------------------------------------------
 
 Means (n = 34,459)      18 minutes      <1 minute         15 minutes
 Totals (n = 34,459)     20 minutes           a                 a
 
 Means (n = 28,704)      13 minutes      <1 minute         12 minutes
 Totals (n = 28,704)     14 minutes           a                 a
 
 Ratios (n = 34,459)     64 minutes      <1 minute         33 minutes
 Totals (n = 34,459)     23 minutes           b                 b
 
 -------------------------------------------------------------------------------
 (a) Included as output with mean estimates.
 (b) Included as output with ratio estimates.

Explanation

By default, the svy commands compute the covariance for all combinations of variables and subgroups. If there are several variables and a lot of subgroups, this can be a sizable computation.

If the commands are run differently (one variable at a time), one can get the same output in less time. I estimate that each of the runs in Table 4 of the article could have been done in 10 minutes or less (on the reviewer's computer), rather than the 13-64 minutes cited in the table.

One likely only wants the covariances of different subgroups for the same variable. One can get only these covariances by estimating for one variable per command. This is significantly faster when there are lots of subgroups (> ~10).

For example, consider the command:

    . svymean totalexp totalsp1 totalsp2, by(age race3 smpsexr)

Age has 5 categories, race3 has 3, and smpsexr has 2. Thus there are a total of 5 x 3 x 2 = 30 subgroups. There are 3 variables, and the svymean command, in the course of estimating the 90 means, by default, also computes the 90 x 90 covariance matrix (90*91/2 = 4095 elements).

The covariances are useful if you then want to estimate the subgroup differences using the svylc command. However, it is unlikely that you want to estimate the difference of totalexp and totalsp1 in different subgroups; rather you only want to estimate the differences of totalexp (and separately totalsp1 and totalsp2) in different subgroups. Thus you can run the three commands

    . svymean totalexp, by(age race3 smpsexr)
    . svymean totalsp1, by(age race3 smpsexr)
    . svymean totalsp2, by(age race3 smpsexr)

Since each of these commands computes only a 30 x 30 covariance matrix, it is faster (only 30*31/2 + 30*31/2 + 30*31/2 = 465 + 465 + 465 = 1395 elements are computed).

If you don’t care about the covariances (and don’t intend to estimate subgroup differences), you can simply use the available option.

    . svymean totalexp totalsp1 totalsp2, by(age race3 smpsexr) available

The available option automatically does computations one variable at a time. It is equivalent to running the three commands above, with one variable per command (in fact, this is what the code actually does).

The ratio timings in the review were particularly slow since 6 ratios for each subgroup were computed in each command. For example, for the svyratio command run by(age race3 smpsexr), 6 x 30 = 180 ratios were computed, along with their 180 x 180 covariance matrix.

Duplicating the AmStat review timings

I have attempted to duplicate two of the timings from Table 4 (p. 289) of the AmStat review.

  1. the timing for means
  2. the timing for ratios

I duplicated the runs using the commands that the reviewer did and also with a set of commands that produced the same output but much faster.

I used simulated data based on the description of the data in the review, so the number of observations and number of subgroups were the same. These are the main factors that affect the timings, so there should not be much difference due to the different data sets.

 Table I:  My timings compared to reviewer's timings
 -------------------------------------------------------------------------------
 
	                    My timings(2)
                          ---------------------------------
                               Duplication         Faster set
 Statistic   Reviewer(1)  of reviewer's runs    of commands   fast/slow
                 A                B                  C           C/B
 ----------------------------------------------------------------------
 
   Means      18 min.        2.80 min.(3)       1.87 min.(4)     67 %
 
   Ratios     64 min.       62.1  min.(5)       8.9  min.(6)     14 %
 
 -------------------------------------------------------------------------------
Notes:
  1. Reviewer used a 75-MHz Pentium, 16M RAM. I believe the OS was Windows 3.1; it did not say in the article. Virtual memory is NOT an issue. The data set fit in within available RAM.
  2. I used a 233-MHz Pentium, 64M RAM, running Windows 95.
  3. See Table III below for the commands used; these commands for "means" were given explicitly in the review (p. 288).
  4. See Table IV below for the commands used.
  5. See Table V below for the commands used. The commands for the "ratios" were inferred from the article. Estimates were obtained for the same subgroups for the "ratios" as for the "means", as the article stated.
  6. See Table VI below for the commands used.

Comments:

I am baffled as to why the reviewer’s “ratios” run and my duplication of it had about the same timings, whereas our “means” runs where so different. I'd expect my runs to be about three times faster given my faster machine.

 Table II:  Some command-by-command comparisons
 -------------------------------------------------------------------------------
 
              Command                                                  Time
 ------------------------------------------------------------------------------
 
 svymean totalexp totalsp1 totalsp2, by(age race3 smpsexr)            66.5 sec.
 
 svymean totalexp, by(age race3 smpsexr)                               8.8 sec.
 svymean totalsp1, by(age race3 smpsexr)                               8.1 sec.
 svymean totalsp2, by(age race3 smpsexr)                               8.0 sec.
                                                                      ---------
                                                               total  24.9 sec.
 ------------------------------------------------------------------------------
 
 svyratio totalsp1/totalexp totalsp2/totalexp totalsp3/totalexp
          totalsp4/totalexp totalsp5/totalexp totalsp6/totalexp,
   by(age race3 smpsexr)                                      35.6  min.
 
 svyratio totalsp1/totalexp, by(age race3 smpsexr)                    0.49 min.
 svyratio totalsp2/totalexp, by(age race3 smpsexr)                    0.49 min.
 svyratio totalsp3/totalexp, by(age race3 smpsexr)                    0.48 min.
 svyratio totalsp4/totalexp, by(age race3 smpsexr)                    0.48 min.
 svyratio totalsp5/totalexp, by(age race3 smpsexr)                    0.48 min.
 svyratio totalsp6/totalexp, by(age race3 smpsexr)                    0.53 min.
                                                                     ----------
                                                               total  2.95 min.
 -------------------------------------------------------------------------------
 Table III:  Commands used by reviewer for means (from p. 288 of review)
 -------------------------------------------------------------------------------
 
      Command                                       Number of subgroups
 -----------------------------------------------------------------------------
 
 1.  svymean totalexp totalsp1 totalsp2                                      0
 2.  svymean totalexp totalsp1 totalsp2, by(age)                             5
 3.  svymean totalexp totalsp1 totalsp2, by(smpsexr)                         2
 4.  svymean totalexp totalsp1 totalsp2, by(race3)                           3
 5.  svymean totalexp totalsp1 totalsp2, by(povstal)                         5
 6.  svymean totalexp totalsp1 totalsp2, by(ratehlth)                        4
 7.  svymean totalexp totalsp1 totalsp2, by(ssmsa)                           4
 8.  svymean totalexp totalsp1 totalsp2, by(sregion)                         4
 9.  svymean totalexp totalsp1 totalsp2, by(cendiv)                          9
 10. svymean totalexp totalsp1 totalsp2, by(povstal ratehlth)       5 x 4 = 20
 11. svymean totalexp totalsp1 totalsp2, by(age race3)              5 x 3 = 15
 12. svymean totalexp totalsp1 totalsp2, by(age smpsexr)            5 x 2 = 10
 13. svymean totalexp totalsp1 totalsp2, by(race3 smpsexr)          3 x 2 =  6
 14. svymean totalexp totalsp1 totalsp2, by(age race3 smpsexr)  5 x 3 x 2 = 30
 -------------------------------------------------------------------------------

 Table IV:  Faster way to get the same output as Table III commands
 -------------------------------------------------------------------------------
 
 1-9.    (use same commands as Table III)
 
 10.     svymean totalexp, by(povstal ratehlth)
         svymean totalsp1, by(povstal ratehlth)
         svymean totalsp2, by(povstal ratehlth)
 
 11.     svymean totalexp, by(age race3)
         svymean totalsp1, by(age race3)
         svymean totalsp2, by(age race3)
 
 12-13.  (use same commands as Table III)
 
 14.     svymean totalexp, by(age race3 smpsexr)
         svymean totalsp1, by(age race3 smpsexr)
         svymean totalsp2, by(age race3 smpsexr)
 -------------------------------------------------------------------------------
 Table V:  Commands used by reviewer for ratios (as implied by text)
 -------------------------------------------------------------------------------
 
 1.  svyratio totalsp1/totalexp totalsp2/totalexp totalsp3/totalexp
              totalsp4/totalexp totalsp5/totalexp totalsp6/totalexp
 
 2.  svyratio                       "                    , by(age)
 3.  svyratio                       "                    , by(smpsexr)
 4.  svyratio                       "                    , by(race3)
 5.  svyratio                       "                    , by(povstal)
 6.  svyratio                       "                    , by(ratehlth)
 7.  svyratio                       "                    , by(ssmsa)
 8.  svyratio                       "                    , by(sregion)
 9.  svyratio                       "                    , by(cendiv)
 10. svyratio                       "                    , by(povstal ratehlth)
 11. svyratio                       "                    , by(age race3)
 12. svyratio                       "                    , by(age smpsexr)
 13. svyratio                       "                    , by(race3 smpsexr)
 14. svyratio                       "                    , by(age race3 smpsexr)
 -------------------------------------------------------------------------------
 Table VI:  Faster way to get the same output as Table V commands
 -------------------------------------------------------------------------------
 
 1-8.    (use same commands as Table V)
 
 9.      svyratio totalsp1/totalexp, by(cendiv)
         svyratio totalsp2/totalexp, by(cendiv)
         svyratio totalsp3/totalexp, by(cendiv)
         svyratio totalsp4/totalexp, by(cendiv)
         svyratio totalsp5/totalexp, by(cendiv)
         svyratio totalsp6/totalexp, by(cendiv)
 
 10.     svyratio totalsp1/totalexp, by(povstal ratehlth)
         svyratio totalsp2/totalexp, by(povstal ratehlth)
	 ...
        
 11.     svyratio totalsp1/totalexp, by(age race3)
         svyratio totalsp2/totalexp, by(age race3)
         ...
    
 12.     svyratio totalsp1/totalexp, by(age smpsexr)
         svyratio totalsp2/totalexp, by(age smpsexr)
         ...
    
 13.     svyratio totalsp1/totalexp, by(race3 smpsexr)
         svyratio totalsp2/totalexp, by(race3 smpsexr)
         ...
    
 14.     svyratio totalsp1/totalexp, by(age race3 smpsexr)
         svyratio totalsp2/totalexp, by(age race3 smpsexr)
         ...
 -------------------------------------------------------------------------------

The bottom line

Even if you run the commands as suggested above (Tables IV and VI), Stata’s svy commands are still slower than the equivalent SUDAAN runs. Unless there are dozens of subgroups, this difference should be only a few minutes.

This is because of the following:

  1. The svy commands always compute the covariance (even the available option computes the covariances for each variable, only they are not saved).
  2. For subgroups defined by multiple variables (e.g. group1 x group2 x group3), you must run multiple svy commands to duplicate all the SUDAAN output. There are a few seconds overhead with each command (sorting data, etc.), so this adds a minute or so to the timings.

(1) will be changed in the next release of Stata, so the available option computes no covariances whatsoever.

Note

I, not the reviewer, was responsible for the way the commands were run in the review. I told the reviewer to put more than one variable in each command, since he was counting the total number of commands as a measure of “ease of application”.

As shown above, this is a slightly more efficient way to run the commands when there are only a few subgroups (< ∼10), but not when there are lots of subgroups. I overlooked this fact when advising the reviewer.