Notice: On March 31, it was **announced** that Statalist is moving from an email list to a **forum**. The old list will shut down on April 23, and its replacement, **statalist.org** is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Nick Cox <njcoxstata@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: calculating the whiskers on a boxplot using -twoway- |

Date |
Thu, 21 Mar 2013 11:39:36 +0000 |

Sheena has unearthed an error in my code given in the article. Thanks and also sorry about that. The problems are centred on p.484, which is acccessible to all via a .pdf. Let us focus on the end of the upper whisker; the case of the lower whisker follows similar logic, but taking a minimum wherever we take a maximum, and vice versa. Given interquartile range IQR, the position (not length) of the end of the upper whisker is that of the largest value not greater than (upper quartile +1.5 IQR). My code works correctly if there are no values above the position of the end of the upper whisker. Otherwise, it yields (upper quartile + 1.5 IQR) as the position, but only exceptionally if there are values equal to that will this position be correct. Commonly that will be too high. To see what is going on, the article's example starts with sysuse lifeexp, clear egen upq = pctile(lexp), by(region) p(75) egen loq = pctile(lexp), by(region) p(25) gen iqr = upq - loq and that holds good. Then we have the problem lines: egen upper = max(min(lexp, upq + 1.5 * iqr)), by(region) egen lower = min(max(lexp, loq - 1.5 * iqr)), by(region) More careful code might start with Sheena's own code, or alternatively by egen upper2 = max(lexp / (lexp < upq + 1.5 * iqr)), by(region) egen lower2 = min(lexp / (lexp > loq - 1.5 * iqr)), by(region) That division / may look odd if you have not seen it before. But it is very like the sort of conditional statement often seen max(argument | condition) min(argument | condition) the connection being given in this way. Divide an argument by a logical expression that evaluates to 1 or 0. The argument is unchanged on division by 1 but evaluates as missing on division by 0. In any context where Stata ignores missings, that is what is wanted. Furthermore: this code avoids the problem Sheena encountered, that using -if- leaves missings whenever the -if- condition is not satisfied. The "divide by zero" trick appears not to be widely known. There is some publicity within SJ-11-2 dm0055 . . . . . . . . . . . . . . Speaking Stata: Compared with ... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N. J. Cox Q2/11 SJ 11(2):305--314 (no commands) reviews techniques for relating values to values in other observations Back to the box plots: Now see what the difference is in our example: . tabdisp region, c(upper upper2 lower lower2) ------------------------------------------------------------- Region | upper upper2 lower lower2 -------------+----------------------------------------------- Eur & C.Asia | 79 79 65 65 N.A. | 79 79 58.5 64 S.A. | 75 75 63 67 ------------------------------------------------------------- The results can be the same, but need not be! Checking Stata's own box plot . graph box lexp , over(region) yli(75 79 64 65 67) shows consistency with the revised rule. Ironically, or otherwise, the same mistake appeared in -stripplot- (SSC) before it was corrected. The hubris here (all mine) was being too clever by half, i.e. combining a Stata function call with an -egen- function call in one line and not checking hard enough that the logic was sound in all cases. I'll publish a correction in a future issue of the Stata Journal. The code in -stripplot- (SSC) remains a valid alternative. With a bit of work, you can get data display PLUS boxes PLUS means out of -stripplot-, although whether the result is a mess is part judgement, part a question about your data. Nick On Wed, Mar 20, 2013 at 11:40 PM, Sheena Sullivan <sgsullivan@ucla.edu> wrote: > > I am creating boxplots using -twoway- rather than -graph boxplot- because I > want to display the mean values and it was suggested in a previous Statalist > post to follow the (great) guidance provided in N.J. Cox "Speaking Stata: > Creating and varying box plots." SJ 9(3):478-496. > > However, I'm a little confused about the guidance for calculating the > whiskers. The Cox article suggests that the length of the whiskers can be > generated using the following (syntax for upper whisker only is shown): > . bysort year: egen upper=max(min(logIC50, upq+1.5*iqr)) ///(eq1) > > logIC50 is my variable of interest, upq is its upper quartile, loq is its > lower quartile and iqr is its inter-quartile range. > > The upper limit, upper, should be the largest value (of an observation > within the dataset) not greater than upq + 1.5*iqr > and the lower limit, lower, should be the smallest value not less than loq - > 1.5*iqr > > But they aren't. > > I noticed the problem because if I draw my plots using the -twoway- commands > in the Cox article (see end of email) the whiskers are longer than when > drawn using -graph box-. So I did a manual inspection of the data and > sometimes the values generated by eq1 represent a value in the dataset and > at other times they do not. > > If I instead use the following syntax: > . gen upper1=upq+1.5*iqr > ///(eq2) > . bysort year: egen upper2=max(logIC50) if logIC50<upper1 ///(eq3) > > I can generate a graph using the -twoway- commands in the Cox article with > whiskers that look the same as when I draw it using -graph box- > > However, to get my outliers to appear, I need to include a -scatter- and > using the upper2 and lower2 variables doesn't work because the if statement > in eq4 means no value is generated for lower2 and upper2 for those > observations outside the range. A work around is to fill the gaps by > creating another variable of the mean (or min or max because they are all > the same) of the other observations within that by group: > > . bysort year: egen upper3=mean(upper2) > . bysort year: egen lower3=mean(lower2) > > But this getting to be an awful lot of syntax just to add a mean to a > boxplot! > > So, is there a better way to generate the whiskers? And can anyone help me > understand why eq1 sometimes produces results that do not correspond to a > value in the dataset? > > Thanks, > Sheena > > > Syntax for drawing the boxplot using -twoway- > . twoway rbar med upq year, pstyle(p1) blc(gs4) bfc(gs9) blw(vthin) > barw(0.35) || /// upper half of box >> rbar med loq year, pstyle(p1) blc(gs4) bfc(gs9) blw(vthin) > barw(0.35) || /// lower half of box >> rspike upq upper year, pstyle(p1) || /// upper whisker >> rspike loq lower year, pstyle(p1) || /// lower whisker >> rcap upper upper year, pstyle(p1) msize(*2) || /// cap the > upper whisker >> rcap lower lower year, pstyle(p1) msize(*2) || /// cap the > lower whisker >> scatter mean year, ms(+) msize(1) mlw(vvthin) || /// mean >> scatter logIC50 year if !inrange(logIC50, lower, upper), > ms(O) mc(gs4) msize(1) mlw(vthin) /// outliers >> legend(off) /// >> yla(, ang(h)) ytitle(Log IC50) xtitle("") /// >> yscale(range(-2 2.5)) ylabel(-2 -1 0 1 2) title(B) > b1title(Year) /// >> name(p_B, replace) yline(.53 1.53, lcolor(black) > lpattern(dash)) /// >> scheme(sj) > > Syntax for -graph box- > . graph box logIC50, over(year) /// >> name(pB, replace) yline(.53 1.53, lcolor(black) lpattern(dash)) > /// >> yscale(range(-2 2.5)) ylabel(-2 -1 0 1 2) title(B) b1title(Year) * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/

**References**:**st: calculating the whiskers on a boxplot using -twoway-***From:*"Sheena Sullivan" <sgsullivan@ucla.edu>

- Prev by Date:
**R: st: Interpretation of Two-sample t test with equal variances?** - Next by Date:
**Re: st: While/if for individual observations** - Previous by thread:
**st: calculating the whiskers on a boxplot using -twoway-** - Next by thread:
**st: Testing for serial correlation in panel data when using IV methods** - Index(es):