[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: calculating the whiskers on a boxplot using -twoway-
Nick Cox <email@example.com>
Re: st: calculating the whiskers on a boxplot using -twoway-
Thu, 21 Mar 2013 11:39:36 +0000
Sheena has unearthed an error in my code given in the article. Thanks
and also sorry about that. The problems are centred on p.484, which is
acccessible to all via a .pdf.
Let us focus on the end of the upper whisker; the case of the lower
whisker follows similar logic, but taking a minimum wherever we take a
maximum, and vice versa.
Given interquartile range IQR, the position (not length) of the end of
the upper whisker is that of the largest value not greater than (upper
quartile +1.5 IQR).
My code works correctly if there are no values above the position of
the end of the upper whisker.
Otherwise, it yields (upper quartile + 1.5 IQR) as the position, but
only exceptionally if there are values equal to that will this
position be correct. Commonly that will be too high.
To see what is going on, the article's example starts with
sysuse lifeexp, clear
egen upq = pctile(lexp), by(region) p(75)
egen loq = pctile(lexp), by(region) p(25)
gen iqr = upq - loq
and that holds good. Then we have the problem lines:
egen upper = max(min(lexp, upq + 1.5 * iqr)), by(region)
egen lower = min(max(lexp, loq - 1.5 * iqr)), by(region)
More careful code might start with Sheena's own code, or alternatively by
egen upper2 = max(lexp / (lexp < upq + 1.5 * iqr)), by(region)
egen lower2 = min(lexp / (lexp > loq - 1.5 * iqr)), by(region)
That division / may look odd if you have not seen it before. But it is
very like the sort of conditional statement often seen
max(argument | condition)
min(argument | condition)
the connection being given in this way. Divide an argument by a
logical expression that evaluates to 1 or 0. The argument is unchanged
on division by 1 but evaluates as missing on division by 0. In any
context where Stata ignores missings, that is what is wanted.
Furthermore: this code avoids the problem Sheena encountered, that
using -if- leaves missings whenever the -if- condition is not
The "divide by zero" trick appears not to be widely known. There is
some publicity within
SJ-11-2 dm0055 . . . . . . . . . . . . . . Speaking Stata: Compared with ...
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N. J. Cox
Q2/11 SJ 11(2):305--314 (no commands)
reviews techniques for relating values to values in other
Back to the box plots: Now see what the difference is in our example:
. tabdisp region, c(upper upper2 lower lower2)
Region | upper upper2 lower lower2
Eur & C.Asia | 79 79 65 65
N.A. | 79 79 58.5 64
S.A. | 75 75 63 67
The results can be the same, but need not be!
Checking Stata's own box plot
. graph box lexp , over(region) yli(75 79 64 65 67)
shows consistency with the revised rule.
Ironically, or otherwise, the same mistake appeared in -stripplot-
(SSC) before it was corrected.
The hubris here (all mine) was being too clever by half, i.e.
combining a Stata function call with an -egen- function call in one
line and not checking hard enough that the logic was sound in all
I'll publish a correction in a future issue of the Stata Journal.
The code in -stripplot- (SSC) remains a valid alternative. With a bit
of work, you can get data display PLUS boxes PLUS means out of
-stripplot-, although whether the result is a mess is part judgement,
part a question about your data.
On Wed, Mar 20, 2013 at 11:40 PM, Sheena Sullivan <firstname.lastname@example.org> wrote:
> I am creating boxplots using -twoway- rather than -graph boxplot- because I
> want to display the mean values and it was suggested in a previous Statalist
> post to follow the (great) guidance provided in N.J. Cox "Speaking Stata:
> Creating and varying box plots." SJ 9(3):478-496.
> However, I'm a little confused about the guidance for calculating the
> whiskers. The Cox article suggests that the length of the whiskers can be
> generated using the following (syntax for upper whisker only is shown):
> . bysort year: egen upper=max(min(logIC50, upq+1.5*iqr)) ///(eq1)
> logIC50 is my variable of interest, upq is its upper quartile, loq is its
> lower quartile and iqr is its inter-quartile range.
> The upper limit, upper, should be the largest value (of an observation
> within the dataset) not greater than upq + 1.5*iqr
> and the lower limit, lower, should be the smallest value not less than loq -
> But they aren't.
> I noticed the problem because if I draw my plots using the -twoway- commands
> in the Cox article (see end of email) the whiskers are longer than when
> drawn using -graph box-. So I did a manual inspection of the data and
> sometimes the values generated by eq1 represent a value in the dataset and
> at other times they do not.
> If I instead use the following syntax:
> . gen upper1=upq+1.5*iqr
> . bysort year: egen upper2=max(logIC50) if logIC50<upper1 ///(eq3)
> I can generate a graph using the -twoway- commands in the Cox article with
> whiskers that look the same as when I draw it using -graph box-
> However, to get my outliers to appear, I need to include a -scatter- and
> using the upper2 and lower2 variables doesn't work because the if statement
> in eq4 means no value is generated for lower2 and upper2 for those
> observations outside the range. A work around is to fill the gaps by
> creating another variable of the mean (or min or max because they are all
> the same) of the other observations within that by group:
> . bysort year: egen upper3=mean(upper2)
> . bysort year: egen lower3=mean(lower2)
> But this getting to be an awful lot of syntax just to add a mean to a
> So, is there a better way to generate the whiskers? And can anyone help me
> understand why eq1 sometimes produces results that do not correspond to a
> value in the dataset?
> Syntax for drawing the boxplot using -twoway-
> . twoway rbar med upq year, pstyle(p1) blc(gs4) bfc(gs9) blw(vthin)
> barw(0.35) || /// upper half of box
>> rbar med loq year, pstyle(p1) blc(gs4) bfc(gs9) blw(vthin)
> barw(0.35) || /// lower half of box
>> rspike upq upper year, pstyle(p1) || /// upper whisker
>> rspike loq lower year, pstyle(p1) || /// lower whisker
>> rcap upper upper year, pstyle(p1) msize(*2) || /// cap the
> upper whisker
>> rcap lower lower year, pstyle(p1) msize(*2) || /// cap the
> lower whisker
>> scatter mean year, ms(+) msize(1) mlw(vvthin) || /// mean
>> scatter logIC50 year if !inrange(logIC50, lower, upper),
> ms(O) mc(gs4) msize(1) mlw(vthin) /// outliers
>> legend(off) ///
>> yla(, ang(h)) ytitle(Log IC50) xtitle("") ///
>> yscale(range(-2 2.5)) ylabel(-2 -1 0 1 2) title(B)
> b1title(Year) ///
>> name(p_B, replace) yline(.53 1.53, lcolor(black)
> lpattern(dash)) ///
> Syntax for -graph box-
> . graph box logIC50, over(year) ///
>> name(pB, replace) yline(.53 1.53, lcolor(black) lpattern(dash))
>> yscale(range(-2 2.5)) ylabel(-2 -1 0 1 2) title(B) b1title(Year)
* For searches and help try: