Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Inconsistent results with rocfit


From   Ronan Conroy <rconroy@rcsi.ie>
To   "statalist@hsphsun2.harvard.edu" <statalist@hsphsun2.harvard.edu>
Subject   Re: st: Inconsistent results with rocfit
Date   Tue, 2 Mar 2010 11:09:59 +0000


On 25 Feabh 2010, at 18:30, Paul Seed wrote:

Dear Statalist,

An odd problem has come up.
I have two versions on the same predictor
(as measured & logged) , and one binary outcome.

When I use -roctab-, I get identical estimates of the ROC area.
when I use -rocfit-, I do not.

The problem is reproducible. Using a dataset I'm currently working on, and a similar setup to Paul's, with

. rocfit diagnosis logbnp1 , cont(5)

I get an ROC area of 0.738, very similar to the 0.724 obtained from - roctab-

However,

. rocfit diagnosis bnp1, cont(5)

gives an ROC area of 0.358! -roctab- reports the same area as before, 0.724

It seems to me that the problem is that the -cut- option divides the range of the data into more or less equal lengths, rather than into quantiles. The result is that where the variable is very skewed, the frequencies are skewed. Here are the frequency distributions of the variables generated by the -cut(5)- option:


-> tabulation of cut_bnp1

   cut_bnp1 |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |        109       83.85       83.85
          2 |         15       11.54       95.38
          3 |          3        2.31       97.69
          4 |          2        1.54       99.23
          5 |          1        0.77      100.00
------------+-----------------------------------
      Total |        130      100.00

-> tabulation of cut_logbnp1

cut_logbnp1 |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |         24       18.46       18.46
          2 |         46       35.38       53.85
          3 |         54       41.54       95.38
          4 |          6        4.62      100.00
------------+-----------------------------------
      Total |        130      100.00


As you can see, log_bnp ended up in four groups of which three had adequate numbers, while bnp had almost no observations in three of the five categories. This is what we used to call a misfeature - something that works as described in the manual, but does something that may not be in the user's best interests. I'd suggest the addition of a -group- option that allowed -continuous- to produce n more or less equal sized groups.

The more alert (or anyone still reading this) will also note that - cut(5)- produced five groups in the first instance and four in the second. This seems to me like a bug.

This email has been cc'd to tech support!

Ronan Conroy
=================================

rconroy@rcsi.ie
Royal College of Surgeons in Ireland
Epidemiology Department,
Beaux Lane House, Dublin 2, Ireland
+353 (0)1 402 2431
+353 (0)87 799 97 95
+353 (0)1 402 2764 (Fax - remember them?)
http://rcsi.academia.edu/RonanConroy

P    Before printing, think about the environment




*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index