[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: multiply two data sets

From	[email protected] (Jeff Pitblado, Stata Corp.)
To	[email protected]
Subject	Re: st: multiply two data sets
Date	Tue, 24 Jun 2003 10:01:33 -0500

Natalie Karavarsamis <[email protected]> asks about multiplying two
datasets, treating them as if they were matrices, where one of the datasets is
too big to convert to a matrix:

> I have two data sets; a data file, call this A, which is 41000 rows x 130
> columns, and another file,call this B, 130 rows by 50 columns.
> 
> I want to multiply A and B (C=AxB). It would be ideal to treat A and B as
> matrices and use matrix multiplication but the maximum matrix size is 11000
> x 11000 (we run Stata 7.0 SE). Is there a way around this? If not, are there
> any suggestions of how else to do this? I don't want to cut matrix A (or B)
> into smaller data sets (matrices).  

If you have enough memory to hold both A and C, which I estimate to be just
under 60m, I would suggest using -matrix score-.  -matrix score- will generate
a new variable from the linear combination of elements in a row vector and
the variables in memory.  See [P] matrix score.

To illustrate, the following do-file generates two datasets, a.dta and b.dta,
according to the sizes Natalie indicates:

***** BEGIN: genab.do
* generate some data
clear
set mem 50m
set obs 41000
forval i = 1/130 {
	di as txt "generating a`i'"
	gen double a`i' = uniform()
}
save a, replace

clear
set obs 130
forval i = 1/50 {
	di as txt "generating b`i'"
	gen double b`i' = uniform()
}
save b, replace
exit
***** END: genab.do

In genc.do, prepare for the product by setting the memory to be large enough.
Then put the data from b.dta into a matrix -b- using -mkmat- (notice the trick
I use to get a list of all the variable names into the -`varlist'- macro).

Use the data in a.dta and loop over the columns of matrix -b-, generating each
new column of the new dataset/matrix C using -matrix score-.  Note that when
you grab each column of the matrix -b-, turn it into a row vector and put the
variable names from dataset a.dta as its column names.  Then -matrix score-
does all the work of multiplying.

***** BEGIN: genc.do
* take matrix product of datasets a.dta and b.dta
* make the matrix from b.dta (the smaller dataset)
clear
set mem 60m
use b
local 0
syntax [varlist]
mkmat `varlist', matrix(b)

* use -matrix score- to compute the linear combinations of the variables in
* a.dta, where the coefficients are from the columns of b.dta
use a, clear
local 0
syntax [varlist]
local k = colsof(b)
forval i = 1/`k' {
	matrix bi = b[1...,`i']'
	matrix colnames bi = `varlist'
	matrix score double c`i' = bi
	di as txt "generating c`i'"
}
keep c*
save c, replace
exit
***** END: genc.do

I tested the above do-files using Stata/SE 7.0 and Stata/SE 8.0.

--Jeff
[email protected]
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Prev by Date: st: panel data multinomial logit model
Next by Date: Re: st: bsample
Previous by thread: st: multiply two data sets
Next by thread: Re: Re: st: multiply two data sets
Index(es):
- Date
- Thread