Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: Teach an old dog new tricks


From   "Joseph Coveney" <jcoveney@bigplanet.com>
To   "Statalist" <statalist@hsphsun2.harvard.edu>
Subject   Re: st: Teach an old dog new tricks
Date   Wed, 14 May 2008 02:07:48 +0900

Austin Nichols <austinnichols@gmail.com>:
As for the sparse matrix problem in (A), you can generate a new
variable with all distinct concatenations of rowvar and colvar, then
cycle over the values of that, thereby ignoring the empty cells.

On Tue, May 13, 2008 at 10:18 AM, Sergiy Radyakin
<serjradyakin@gmail.com> wrote:
Thank you all, who responded to my request regarding obtaining a
matrix of means. Besides the answers posted in this thread I have
received a couple of suggestions privately. To summarize and close the
thread, the suggestions can be divided roughly into two groups:

 A. Obtaining all possible levels of the by-variables, then cycling
through these values and computing means for each subgroup. This can
be quite slow, especially in case of "sparse" matrices, where only a
few non-empty cells exist (for a 50x50 matrix -summarize- must be
called 2500 times).

 B. Using other Stata commands which can produce matrix of means as a
by-product. Unfortunately none of them is fast enough either. In
particular, Joseph Coveney suggested using xi to automatically create
all combinations of values and then estimating a univariate
regression. Although this is a very short code, it is perhaps the
slowest, and demands large amounts of memory.
--------------------------------------------------------------------------------

Sergiy, it'll help us to help you better if you're more specific about the
scope of your problem up front; Austin's original reply's -tabmat- seemed
ideal to me given what you gave the list to go on; and my suggestion
works well for the example that you gave in your post, which I took to be
illustrative of scope of the individual summarization that you want to
repeat many times and therefore want to avoid -preserve-s, etc.

Austin's point above about concatenating applies to sparse matrix problems
in (B), too:  see below for timing of a (B)-approach compared to -table ,
contents(mean  )-, which is the benchmark you give in your original post.
Note that -anova , noconstant category()- is used in lieu of -xi: regress ,
noconstant-, because it's more efficient here.

Joseph Coveney

clear *
set matsize 800 // Nothing extraordinary
set memory 10M // Nothing extraordinary
set obs 250000 // I don't know how many you have--is this in the ballpark?
/* A 50 X 50 matrix */
generate byte a = mod(_n, 50)
sort a
generate byte b = mod(_n, 50)
generate float c = uniform()
/* Make that sparse */
foreach var of varlist a b {
replace `var' = 0 if !inrange(`var', 20, 30)
}
*
timer clear 1
quietly forvalues i = 1/10 {
timer on 1
table a b, contents(mean c)
timer off 1
}
timer clear 2
quietly forvalues i = 1/10 {
timer on 2
generate int ab = 100 *a + b // Concatenation
anova c ab, noconstant category(ab)
timer off 2
drop ab
}
timer list
exit


Results:
. timer list
  1:     24.29 /       10 =       2.4295
  2:      7.62 /       10 =       0.7621

. exit

end of do-file


*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index