Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down at the end of May, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: how to evaluate predictions with balanced panel data


From   "Dimitriy V. Masterov" <dvmaster@gmail.com>
To   Statalist <statalist@hsphsun2.harvard.edu>
Subject   st: how to evaluate predictions with balanced panel data
Date   Mon, 3 Jun 2013 19:52:20 -0700

I would like to evaluate several different models that provide
predictions of behavior at a monthly level. The data is balanced, and
n=100,000 and T=12. The outcome is attending a concert in a given
month, so it is zero for ~80% of the people in any month, but there's
a long right tail of heavy users. The predictions I have do not seem
to respect the count nature of the outcome: fractional concerts are
prevalent.

I don't know anything about the models. I only observe 5 different
black-box predictions yhat1,...,yhat5 for each person per month. I do
have an extra year of data that the model builders did not have for
the estimation (though the concert goers are the same), and I would
like to gauge where each performs well (in terms of accuracy and
precision). For instance, does some model predict well for frequent
concert goers, but fails for the couch potatoes. Is the prediction for
January better than the prediction for December? Alternatively, it
would be nice to know the predictions allow me to rank people
correctly in terms of the actuals, even if the exact magnitude cannot
be trusted.

My first thought was to run a fixed effects regressions of actual on
predicted and time dummies and look at the RMSEs for each model. But
that does not answer the question about where each model does well or
if the differences are significant. The distribution of the outcome
also worries me with this approach.

My second idea was to bin the outcome into 0, 1-3, and 3+, and
calculate the confusion matrix, but this ignores the time dimension,
unless I make 12 of these. It's also pretty coarse.

Previous questioners were pointed towards -concord- by N. Cox & T.
Steichen--which has the by() option, but that would require collapsing
the data to annual totals--and Harrel's c (calculated through
-somersd- by R. Newson), which has the cluster option, but I am not
sure that would allow me to deal with the panel data.

How would you tackle this problem with Stata?

DVM
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index