[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: Teach an old dog new tricks

From   "Joseph Coveney" <>
To   "Statalist" <>
Subject   Re: st: Teach an old dog new tricks
Date   Wed, 14 May 2008 02:07:48 +0900

Austin Nichols <>:
As for the sparse matrix problem in (A), you can generate a new
variable with all distinct concatenations of rowvar and colvar, then
cycle over the values of that, thereby ignoring the empty cells.

On Tue, May 13, 2008 at 10:18 AM, Sergiy Radyakin
<> wrote:
Thank you all, who responded to my request regarding obtaining a
matrix of means. Besides the answers posted in this thread I have
received a couple of suggestions privately. To summarize and close the
thread, the suggestions can be divided roughly into two groups:

 A. Obtaining all possible levels of the by-variables, then cycling
through these values and computing means for each subgroup. This can
be quite slow, especially in case of "sparse" matrices, where only a
few non-empty cells exist (for a 50x50 matrix -summarize- must be
called 2500 times).

 B. Using other Stata commands which can produce matrix of means as a
by-product. Unfortunately none of them is fast enough either. In
particular, Joseph Coveney suggested using xi to automatically create
all combinations of values and then estimating a univariate
regression. Although this is a very short code, it is perhaps the
slowest, and demands large amounts of memory.

Sergiy, it'll help us to help you better if you're more specific about the
scope of your problem up front; Austin's original reply's -tabmat- seemed
ideal to me given what you gave the list to go on; and my suggestion
works well for the example that you gave in your post, which I took to be
illustrative of scope of the individual summarization that you want to
repeat many times and therefore want to avoid -preserve-s, etc.

Austin's point above about concatenating applies to sparse matrix problems
in (B), too:  see below for timing of a (B)-approach compared to -table ,
contents(mean  )-, which is the benchmark you give in your original post.
Note that -anova , noconstant category()- is used in lieu of -xi: regress ,
noconstant-, because it's more efficient here.

Joseph Coveney

clear *
set matsize 800 // Nothing extraordinary
set memory 10M // Nothing extraordinary
set obs 250000 // I don't know how many you have--is this in the ballpark?
/* A 50 X 50 matrix */
generate byte a = mod(_n, 50)
sort a
generate byte b = mod(_n, 50)
generate float c = uniform()
/* Make that sparse */
foreach var of varlist a b {
replace `var' = 0 if !inrange(`var', 20, 30)
timer clear 1
quietly forvalues i = 1/10 {
timer on 1
table a b, contents(mean c)
timer off 1
timer clear 2
quietly forvalues i = 1/10 {
timer on 2
generate int ab = 100 *a + b // Concatenation
anova c ab, noconstant category(ab)
timer off 2
drop ab
timer list

. timer list
  1:     24.29 /       10 =       2.4295
  2:      7.62 /       10 =       0.7621

. exit

end of do-file

*   For searches and help try:

© Copyright 1996–2015 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index