Dear All,
I am trying to estimate a three-level discrete-time survival model on
a large data set. Idealy, I can create person-year data by stspliting
the data or by expanding the data, then use GLLAMM to estimate the
model. However, my data is very large (more than 100M before
stsplitting) and my observation time span is long (60 years maximum),
wich togethe will ended up with a huge data set that my computer
cannot handle (I have 2G ram), not mentioning how many weeks it will
take GLLAMM to estimate the model.
Since most of my covariates are categorical, I am thinking about
converting all the rest into categorical covariates and create a
compact data by aggregating the original data. I got some good tip
from Dan Powers's web site
(http://www.la.utexas.edu/course-materials/sociology/soc386L/) that
this can be done using stsplit and strate. So I did something like the
following:
stset durayear, f(marry) id(id)
stsplit tcat, every(1)
egen rcat = group(tcat)
qui strate rcat X Y Z, output(mar, replace)
use mar, clear
list _D _Y _Rate
...
If I understand correctly, here _D represents number of events
(failures) during each time period (RCAT) for each combination of X,
Y, and Z; _Y is the total person-year during the same period of time
and for the same combination of X, Y, and Z; and _Rate = _D/_Y. If
this is true, then intuitively, _D cannot be greater than _Y, but I
have quite a few cases that _D is in fact greater than _Y.
My questions is:
1) Is it true that _D cannot be greater than _Y?
2) Have anyone had same problems before? More importantly,
3) What to do?
Thank you very much!
Best,
Shige
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/