Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Memory requirements for factor variables


From   Austin Nichols <austinnichols@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Memory requirements for factor variables
Date   Mon, 3 May 2010 11:01:14 -0400

Partha--
So you have a problem like below, but with many millions of
observations, so the Mata solution proposed by Federico (and adapted
below) will result in a "unable to allocate real <tmp>[many millions,
501]" error, and you can't use -regress- either?  I.e. let's make sure
we have all the design details before trying to code a solution--are
there many other RHS variables (represented by x below)?  Or one?  Or
none?  About how many obs do you have?

clear all
set matastrict off
set mem 10m
set more off
set seed 123456
loc N 10000
set obs `N'
mata
void factor_reg(r,c,d1,d2,d3,d4,d5,d6,d7,d8,d9,x,y) {
 D=J(r,c,0)
 for(i=1;i<=c;i++) {
  for(j=1;j<=r;j++) {
if (d1[j]==i|d2[j]==i|d3[j]==i|d4[j]==i) D[j,i]=1
if (d5[j]==i|d6[j]==i|d7[j]==i|d8[j]==i) D[j,i]=1
if (d9[j]==i) D[j,i]=1
  }
 }
 X=x,D,J(r,1,1)
 invsym(quadcross(X,X))*quadcross(X,y)
}
end
gen x=rnormal()
forv i=1/9 {
 gen int d`i'=ceil(runiform()*500)
}
gen y=x + rnormal()
tomata
mata: factor_reg(`N',500,d1,d2,d3,d4,d5,d6,d7,d8,d9,x,y)
qui forv i=1/500 {
gen byte Id`i'=(d1==`i')
forv j=2/9{
replace Id`i'=1 if (d`j'==`i')
}
}
regress y x Id*

On Mon, May 3, 2010 at 9:40 AM, Partha Deb <partha.deb@hunter.cuny.edu> wrote:
> Austin,
>
> I'll look at -fese-, etc. but to answer your question - yes, I do mean to
> create dummies based on an OR condition, over 9 categorical variables to be
> precise.  Each is a variable that contains codes for disease categories, and
> an individual may present with more than one disease.  Also, I need the
> coefficients on those dummies (a la the coefficients in a hedonic
> regression) so I can't partial them out.
>
> cheers.
>
> Partha
>
> Austin Nichols wrote:
>>
>> Partha--
>> I think you want to model your code on -fese- (ssc desc fese) or
>> -felsdvreg- or -felsdvregdm- (findit felsdvreg).  But can you give a
>> more germane example?  Do you really mean to create dummies based on
>> an OR condition over 4 categorical variables (testing whether any of
>> the four is a given level)?  Do you need estimates for your 500
>> dummies, or do you just want to partial them out of the regression?
>> The second is much easier than the first.
>>
>> forvalues i=1/100 {
>>  gen byte ID`i' = (D1==`i' | D2==`i' | D3==`i' | D4==`i')
>> }
>>
>>
>> On Mon, May 3, 2010 at 9:23 AM, Partha Deb <partha.deb@hunter.cuny.edu>
>> wrote:
>>
>>>
>>> Federico - that is definitely a solution I hadn't thought of.  But, I do
>>> worry that the "simple" formula for the OLS estimate may not be optimal
>>> given the size of the dataset and potential scaling issues.  I'm still
>>> holding out for a slick answer from the Stata gurus, but I might end up
>>> using yours.  Thanks.
>>>
>>> Partha
>>>
>>>
>>> Federico Belotti wrote:
>>>
>>>>
>>>> Partha,
>>>>
>>>> I think there is no way to do that in stata. An alternative could be
>>>> mata.
>>>> Clearly, you have to write down the ado for your econometric model. An
>>>> example using OLS is below.
>>>>
>>>> HTH
>>>>
>>>> Federico
>>>>
>>>>
>>>> ******  do *******
>>>> clear all
>>>> set mem 10m
>>>> set more off
>>>>
>>>> set seed 123456
>>>>
>>>> set obs 100000
>>>>
>>>> mata
>>>> real matrix factor_reg(rows,cols,d1,d2,d3,d4,x,y) {
>>>>
>>>>       D = J(rows,cols,0)
>>>>       for(i=1;i<=cols;i++) {
>>>>               for(j=1;j<=rows;j++) {
>>>>                       if (d1[j]==i | d2[j]==i | d3[j]==i | d4[j]==i)
>>>> D[j,i]=1
>>>>               }
>>>>       }
>>>>       X = x,D,J(100000,1,1)
>>>>       Y = y
>>>>       beta = invsym(X'X)*(X'Y)
>>>>       beta
>>>> }
>>>> end
>>>>
>>>> gen x = rnormal()
>>>> gen u = rnormal()
>>>> gen int d = int(_n/1000)
>>>> gen int d1 = int(_n/1100)
>>>> gen int d2 = int(_n/1200)
>>>> gen int d3 = int(_n/1300)
>>>> gen int d4 = int(_n/1400)
>>>>
>>>> sum
>>>>
>>>> gen y = x + u
>>>>
>>>> describe,s
>>>>
>>>> regress y x i.d
>>>>
>>>> sum d
>>>>
>>>> tomata
>>>> mata: factor_reg(100000,100,d1,d2,d3,d4,x,y)
>>>>
>>>> forvalues i=1/`r(max)' {
>>>>
>>>> gen byte Id`i' = (d1==`i' | d2==`i' | d3==`i' | d4==`i')
>>>> }
>>>>
>>>> describe,s
>>>>
>>>> regress y x Id*
>>>>
>>>>
>>>> exit
>>>>
>>>>
>>>>
>>>>

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index