Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: RE: graphs, outliers, labels

From   "Nick Cox" <>
To   <>
Subject   st: RE: graphs, outliers, labels
Date   Sun, 16 May 2004 18:59:03 +0100

Daphna Bassok started a thread by asking various 
questions on box plots. Here I edit slightly, also 
numbering the questions, DB1 ... DB4.  
DB1. Is there any way I can get labels on my box plot graphs?

Guilherme Silva answered 

>> Supposing the variable of interest is named "xvar", 
>> the variable of identification - "case",  and that you 
>> have seen just 4 outliers in a previous screening ... then to 
>> identify outliers (outsides in the box plot) you may type: 
>> . graph box xvar, medtype(line) mark(1,mlabel(case)) ... 

and he pointed out that the rule is a separate -mark(,)- option 
for each y variable. 

DB2. I would like to see the values of the median, 25th percentile, 
75 %...etc.   

I answered 

>> Use -summarize, detail- to see the median and quartiles. 

DB3. I want to see/know the values of the top and bottom cut off lines.  
How do I find these values?

I answered 

>> The adjacent values are the extreme data points within 
>> 1.5 iqr of the nearer quartile. I think you might have
>> to re-create those for yourself, as -graph box- doesn't 
>> seem to leave them in memory. Nor should it really, 
>> as there could be lots of them. 

I also posted the code of a program -adjacent- to 
calculate these, and commented 

>> I seem to get the same values as do the box
>> plot routines. Note that adjacent values 
>> need not be unique. More testing advisable. 

Ric Uslaner wrote 

>> I copied -adjacent- into the do file editor and 
>> tried to run it ... and this is what I got:
>> you must specify the lname() option
>> r(198);

whereas Clive Nicholas reported no problem. 
He suggested -update q-. 

The message Ric was seeing was coming from 
official Stata -egen, group()-, which is 
called by -adjacent-. I am not 
clear why he's getting it. As far as I can 
see it shouldn't happen. If it persists, 
do flag that privately. 

Daphna also asked privately, and I take
the liberty of echoing the question 
here as others may be interested: 

>> I am not sure I follow why the lower 
>> and upper adjacent values are not 
>> unique for a given population

What I meant was that there could 
be ties for adjacent value. Naturally, 
there could also be ties even for the
most extreme outliers. 

I have now extended -adjacent- so that 
it supports multiple variables in 
the varlist and also frequency and 
analytic weights. I'll send the files
to Kit Baum for posting on SSC. 

DB4. I am interested in analyzing the outliers or outside values, 
but I am not able to see what the specific lower and upper cut off 
values are.

Another program which may be of interest here 
is -extremes- from SSC. With the -iqr- option, 
or with -iqr(1.5)- you can see which observations
are more than 1.5 iqr from the nearer quartile: 

. extremes mpg, iqr 

  | obs:    iqr:   mpg |
  |  59.   2.286    41 |

What's often more useful is to specify 
other variables which are included
in the listing as context: 

. extremes mpg make, iqr 

  | obs:    iqr:   mpg   make      |
  |  59.   2.286    41   VW Diesel |

Just added to -extremes-, but not yet
in the version on SSC is support for -by:-. 

. bysort for : extremes mpg make, iqr 

-> foreign = Domestic

  | obs:    iqr:   mpg   make        |
  |  23.   2.182    34   Plym. Champ |

-> foreign = Foreign

  | obs:    iqr:   mpg   make      |
  |  71.   1.857    41   VW Diesel |


*   For searches and help try:

© Copyright 1996–2017 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index