Notice: On March 31, it was **announced** that Statalist is moving from an email list to a **forum**. The old list will shut down at the end of May, and its replacement, **statalist.org** is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Austin Nichols <austinnichols@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: Memory requirements for factor variables |

Date |
Mon, 3 May 2010 11:01:14 -0400 |

Partha-- So you have a problem like below, but with many millions of observations, so the Mata solution proposed by Federico (and adapted below) will result in a "unable to allocate real <tmp>[many millions, 501]" error, and you can't use -regress- either? I.e. let's make sure we have all the design details before trying to code a solution--are there many other RHS variables (represented by x below)? Or one? Or none? About how many obs do you have? clear all set matastrict off set mem 10m set more off set seed 123456 loc N 10000 set obs `N' mata void factor_reg(r,c,d1,d2,d3,d4,d5,d6,d7,d8,d9,x,y) { D=J(r,c,0) for(i=1;i<=c;i++) { for(j=1;j<=r;j++) { if (d1[j]==i|d2[j]==i|d3[j]==i|d4[j]==i) D[j,i]=1 if (d5[j]==i|d6[j]==i|d7[j]==i|d8[j]==i) D[j,i]=1 if (d9[j]==i) D[j,i]=1 } } X=x,D,J(r,1,1) invsym(quadcross(X,X))*quadcross(X,y) } end gen x=rnormal() forv i=1/9 { gen int d`i'=ceil(runiform()*500) } gen y=x + rnormal() tomata mata: factor_reg(`N',500,d1,d2,d3,d4,d5,d6,d7,d8,d9,x,y) qui forv i=1/500 { gen byte Id`i'=(d1==`i') forv j=2/9{ replace Id`i'=1 if (d`j'==`i') } } regress y x Id* On Mon, May 3, 2010 at 9:40 AM, Partha Deb <partha.deb@hunter.cuny.edu> wrote: > Austin, > > I'll look at -fese-, etc. but to answer your question - yes, I do mean to > create dummies based on an OR condition, over 9 categorical variables to be > precise. Each is a variable that contains codes for disease categories, and > an individual may present with more than one disease. Also, I need the > coefficients on those dummies (a la the coefficients in a hedonic > regression) so I can't partial them out. > > cheers. > > Partha > > Austin Nichols wrote: >> >> Partha-- >> I think you want to model your code on -fese- (ssc desc fese) or >> -felsdvreg- or -felsdvregdm- (findit felsdvreg). But can you give a >> more germane example? Do you really mean to create dummies based on >> an OR condition over 4 categorical variables (testing whether any of >> the four is a given level)? Do you need estimates for your 500 >> dummies, or do you just want to partial them out of the regression? >> The second is much easier than the first. >> >> forvalues i=1/100 { >> gen byte ID`i' = (D1==`i' | D2==`i' | D3==`i' | D4==`i') >> } >> >> >> On Mon, May 3, 2010 at 9:23 AM, Partha Deb <partha.deb@hunter.cuny.edu> >> wrote: >> >>> >>> Federico - that is definitely a solution I hadn't thought of. But, I do >>> worry that the "simple" formula for the OLS estimate may not be optimal >>> given the size of the dataset and potential scaling issues. I'm still >>> holding out for a slick answer from the Stata gurus, but I might end up >>> using yours. Thanks. >>> >>> Partha >>> >>> >>> Federico Belotti wrote: >>> >>>> >>>> Partha, >>>> >>>> I think there is no way to do that in stata. An alternative could be >>>> mata. >>>> Clearly, you have to write down the ado for your econometric model. An >>>> example using OLS is below. >>>> >>>> HTH >>>> >>>> Federico >>>> >>>> >>>> ****** do ******* >>>> clear all >>>> set mem 10m >>>> set more off >>>> >>>> set seed 123456 >>>> >>>> set obs 100000 >>>> >>>> mata >>>> real matrix factor_reg(rows,cols,d1,d2,d3,d4,x,y) { >>>> >>>> D = J(rows,cols,0) >>>> for(i=1;i<=cols;i++) { >>>> for(j=1;j<=rows;j++) { >>>> if (d1[j]==i | d2[j]==i | d3[j]==i | d4[j]==i) >>>> D[j,i]=1 >>>> } >>>> } >>>> X = x,D,J(100000,1,1) >>>> Y = y >>>> beta = invsym(X'X)*(X'Y) >>>> beta >>>> } >>>> end >>>> >>>> gen x = rnormal() >>>> gen u = rnormal() >>>> gen int d = int(_n/1000) >>>> gen int d1 = int(_n/1100) >>>> gen int d2 = int(_n/1200) >>>> gen int d3 = int(_n/1300) >>>> gen int d4 = int(_n/1400) >>>> >>>> sum >>>> >>>> gen y = x + u >>>> >>>> describe,s >>>> >>>> regress y x i.d >>>> >>>> sum d >>>> >>>> tomata >>>> mata: factor_reg(100000,100,d1,d2,d3,d4,x,y) >>>> >>>> forvalues i=1/`r(max)' { >>>> >>>> gen byte Id`i' = (d1==`i' | d2==`i' | d3==`i' | d4==`i') >>>> } >>>> >>>> describe,s >>>> >>>> regress y x Id* >>>> >>>> >>>> exit >>>> >>>> >>>> >>>> * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**References**:**st: Memory requirements for factor variables***From:*Partha Deb <partha.deb@hunter.cuny.edu>

**Re: st: Memory requirements for factor variables***From:*Federico Belotti <f.belotti@econometrics.it>

**Re: st: Memory requirements for factor variables***From:*Partha Deb <partha.deb@hunter.cuny.edu>

**Re: st: Memory requirements for factor variables***From:*Austin Nichols <austinnichols@gmail.com>

**Re: st: Memory requirements for factor variables***From:*Partha Deb <partha.deb@hunter.cuny.edu>

- Prev by Date:
**st: AW: RE: AW: RE: SSC Activity, April 2010** - Next by Date:
**Re: st: Text size in graphs** - Previous by thread:
**Re: st: Memory requirements for factor variables** - Next by thread:
**Re: st: Memory requirements for factor variables** - Index(es):