Find below some code that reproduces the precision problem
I was talking about yesterday. The function -cdf(x,w)-
computes the empirical cumulative distribution function of
"x", where "w" contains case weights. Note that equal
x-values get the same CDF value. Repeated application of
-cdf()- yields slightly different results unless quad
precision is used computing the running sum of "w". Note
that the problem only arises if variable "w" is double
(opposed to float).
The problem is related to the unstable sort order of "x"
(i.e. -order(x,1)-) if "x" contains ties. Thus, one
solution is to use stable sort order (e.g.
-order((x,range(1,rows(x),1)),(1,2))-). However, this does
not change the fact that the non-quad running sum computed
within -cdf()- is not accurate. I still think it would be
beneficial to have a fast -quadrunsum()- function in Mata.
I imagine that this could be easily done by Stata Corp.
modifying the C code for -quadsum()-.
ben
. clear
. set obs 1000
. gen x = 1 + int((_n-1)/100)
. gen double w = exp(invnormal(uniform()))
. mata:
: function cdf(x, w, quadpr)
> {
> p = order(x, 1)
> if (quadpr) {
> cdf = J(rows(w),1,.)
> for (i=1; i<=rows(cdf); i++) {
> cdf[i] = quadsum(w[p[|1 \ i|]])
> }
> }
> else {
> cdf = w[p]
> for (i=2; i<=rows(cdf); i++) {
> cdf[i] = cdf[i-1] + cdf[i]
> }
> }
> cdf = cdf / cdf[rows(cdf)]
> cdf = cdf[invorder(p)]
> for(i=rows(p)-1; i>=1; i--) {
> if (x[p[i]]==x[p[i+1]]) {
> cdf[p[i]] = cdf[p[i+1]]
> }
> }
> return(cdf)
> }
: x = st_data(.,"x")
: w = st_data(.,"w")
: sum(abs(cdf(x, w, 0)-cdf(x, w, 0)))
6.93889e-15
: sum(abs(cdf(x, w, 1)-cdf(x, w, 1)))
0
: end
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/