Statalist The Stata Listserver


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: "logistic scores"


From   Marcello Pagano <pagano@hsph.harvard.edu>
From   Nick.Cox@hsphsun2.harvard.edu
To   statalist@hsphsun2.harvard.edu
To   'statalist@hsphsun2.harvard.edu'
Subject   st: "logistic scores"
Subject   "logistic scores"
Date   Sun, 18 Mar 2007 16:55:33 -0400


My questions come at the end.
It's a habit of mine to revisit my favourite books. Looking again at
Mosteller, F. and Tukey, J.W. 1977. Data analysis and regression. Reading, MA: Addison-Wesley. Chs 5F, 5H, 11F, 11G.
I found a very Tukeyish way of mapping the frequencies
of a set of ordered categories (grades) to numerical scores. Each category is treated as a slice from a standard logistic distribution and what is returned is a centre of gravity for that slice. The recipe is first to calculate cumulative probabilities p for less
than each grade and cumulative probabilities P for
less or equal to each grade and then, defining
phi(p) = p ln p + (1 - p) ln (1 - p),
to calculate scores that are

(phi(P) - phi(p)) / (P - p).
(I've not re-created the derivation for myself.)
I call these "logistic scores".
The logistic is justified by Mosteller and Tukey
as convenient to work with, and as giving similar results to Gaussian and Cauchy alternatives any way. Computational ease is naturally less compelling in 2007 than it was in 1977, but simple and useful
still wins every time in the absence of better
alternatives.
This kind of thing goes nicely in Mata and here
is a function to do it:
// NJC 16 March 2007
// cf. Mosteller, F. and Tukey, J.W. 1977. Data analysis and regression. // Reading, MA: Addison-Wesley. Chs 5F, 5H, 11F, 11G. real logistic_scores(real colvector freq)
{ real colvector P, p, zero, z real scalar k
k = rows(freq) P = freq
for(i = 2; i <= k; i++) { P[i] = P[i - 1] + P[i] }

P = P / P[k] zero = J(k, 1, 0) z = rowmin((zero, P :* ln(P) + (1 :- P) :* ln(1 :- P)))
p = 0 \ P[1..k-1] z = z - rowmin((zero, p :* ln(p) + (1 :- p) :* ln(1 :- p)))
z = z :/ (P - p)
return(z) }

end

A detail that requires care is handling terms like p ln p when p is zero
and its logarithm would thus be indeterminate. It is natural
mathematically to regard the overall product as zero, but you have
to spell that out to Mata. The ? : construct seems less useful here
than comparing directly with a vector of zeros.
Any way, using the example in Mosteller and Tukey (1977, p.106)
of grades A .. E, we type in a vector of frequencies and
get scores:
: freq = (127\497\3243\231\74)

: logistic_scores(freq)
1
+----------------+
1 | -4.476586375 |
2 | -2.39817005 |
3 | .206295676 |
4 | 3.115523631 |
5 | 5.023164169 |
+----------------+

My questions:
1. My impression is that there is a tenuous connection here with what ordered logit does, but I don't think the latter is quite equivalent, even indirectly, because
it works with cutpoints between grades, not the grades
themselves. Someone well into that and similar models may care to comment.
By the way, I am pretty clear (perhaps wrongly) that I
am not asking about correspondence analysis here, which
I think requires a two-way table to do its magic. I am only interested for the moment in recipes for single variables.
2. I have a hard time finding examples of this
device of Mosteller and Tukey ever being used, apart from a couple of instances in educational statistics. They may exist, but I am looking in the wrong places. If anyone, especially on the biostatistical side, recognises this as a standard tool, or can say what people do instead, please signal.
Nick n.j.cox@durham.ac.uk
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/




© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index