Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

# Re: st: Scale break in box plot

 From Nick Cox To "statalist@hsphsun2.harvard.edu" Subject Re: st: Scale break in box plot Date Tue, 17 Dec 2013 11:35:35 +0000

```I think I follow David on fundamentals here.

Scale breaks are at best the lesser of two evils and the FAQ cited at
http://www.stata.com/support/faqs/graphics/scale-breaks/ does give
more than a hint that there are other ways to tackle the perceived
problem.

I would say that 90% of the scale breaks I see don't serve readers
well in the sense that it would be much better to use a transformed
scale.

A particular detail with box plots as drawn by Stata is that if that
not only the median and quartiles but also which values are more than
1.5 IQR from the nearer quartile depend on the calculation of summary
values. So that needs to be done before any outliers are omitted or
changed for graphical purposes.

I have taught box plots, including points being plotted separately
when at least 1.5 IQR from the nearer quartile, for 36 years or so
(can't match David). My view is that students will see box plots and
deserve an explanation of what they are.

But I much prefer other strategies.

One is quantile plots showing all the ordered values. -qplot- from SJ
is quite flexible. A singular merit of quantiles is that they mesh
well with transformations likely to be of interest in this context.

Another is to keep the box plot idea but to define whiskers as
extending to paired quantiles in the tails (e.g.) 5 and 95% or 1 and
99% points. That version of the box plot goes back at least to 1985
and is easier to explain in my view. Naturally the aim is _not_ quite
the same as with any rule of thumb to identify possible outliers (or
extreme points labelled using some other term) but in my experience it
is as or more useful. -stripplot- from SSC supports this variant.

Nick
njcoxstata@gmail.com

On 17 December 2013 03:37, David Hoaglin <dchoaglin@gmail.com> wrote:
> Dear Rakesh,
>
> If you would like to insert a break in the scale, my reaction (based
> on more than 40 years of experience with boxplots) is that the data
> may be suggesting that you do something different.
>
> Observations that are plotted individually at the ends of a boxplot
> are not necessarily "outliers."  In samples of well-behaved data
> (i.e., from a normal distribution), the standard definition of the
> boxplot causes observations to be plotted individually more often than
> if they were truly outliers.  Hoaglin et al. (1986) and Hoaglin and
> Iglewicz (1987) give some further information.  In Exploratory Data
> Analysis such observations are simply referred to as "outside."  The
> idea is to give them special attention, to see whether some reason
> accounts for their being "outside."
>
> It would be helpful to know more about the data on which your boxplot is based:
> What is the variable?
> How many observations?
> How many observations are "outside" at the low end?
> How many observations are "outside" at the high end?
>
> If, for example, all the "outside" observations are at the high end,
> and they seem to be part of a skewed pattern, you may want to consider
> applying a transformation, such as the logarithm or the square root.
>
> I hope this information is helpful.
>
> David Hoaglin
>
> Hoaglin DC, Iglewicz B, Tukey JW (1986).  Performance of some
> resistant rules for outlier labeling.  Journal of the American
> Statistical Association 81:991-999.
>
> Hoaglin DC, Iglewicz B (1987).  Fine-tuning some resistant rules for
> outlier labeling.  Journal of the American Statistical Association
> 82:1147-1149.
>
>
> On Mon, Dec 16, 2013 at 2:24 PM, Rakesh Ghosh <rakeshgh@usc.edu> wrote:
>>>>> Dear Stata list members
>>>>>
>>>>> I have a box plot with many outliers. I would like to insert a scale break to increase the box size and reduce the span of the outliers. I tried both of the options in this Stata scale break link (http://www.stata.com/support/faqs/graphics/scale-breaks/). While inserting a line will not work in my case because I have no break in data points, the second option does work when I create a box plot and a scatter plot and then combine them together.
>>>>
>>>>> -graph box trafficdensity if trafficdensity>0 & trafficdensity<=125, over(county)-
>>>>>
>>>>> However, the median, p25 and p75 are underestimated because I restrict the upper limit of the box plot, so it is not good for me. I will have to restrict the upper limit otherwise I will not get the plot of desirable size. Is there any way you can think how I can insert a break on the y axis?
>>>>>
>>>>> Thanks for any suggestion.
>>>>>
>>>>> Rakesh Ghosh
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/
```