Note: This FAQ is relevant for users of releases prior to Stata 8.
How do I connect points only within groups?
|
Title
|
|
Connecting points within groups
|
|
Author
|
Nicholas J. Cox, Durham University, UK
|
|
Date
|
March 1998; revised October 1999
|
To connect points with straight lines on a two-way graph, specify
graph’s connect() option:
. graph y x, connect(l)
The l inside the parentheses is called the connect style, and
connect(l) is probably the most common. It connects all the points
shown on the graph, joining them according to the current sort order of the
data. The connect() option may be abbreviated all the way down to
c(), so most people would type
. graph y x, c(l)
Understand that c(l) connects the points in the order of the data.
If the data are time series in time order, this gives a line graph showing
successive changes, say, from year to year. If the data are in some other
order, c(l) may be useful for showing trajectories in the space
defined by any two variables.
Another connect style is c(L): it joins points if and only if
successive values of the x variable (on the horizontal axis) are in
ascending order. To be precise, this does include cases where values of the
x variable are constant. This may sound like a rather special case, but
c(L) can be very useful for ensuring that points are joined only in
groups.
The general recipe for connecting points within groups consists of three
steps:
- Make sure the groups are identifiable to Stata. If a categorical
variable defining the groups does not already exist, use
generate
or
egen to produce it.
- Sort the observations into the order needed, using either
sort or gsort. The principle is often that
the last shall be first, and the first shall be last.
- Use graph (or
some other graph command) with c(L).
Say you wish to plot y versus x, connecting the points, by group. That is,
you want (x,y) plotted and the points connected for group 1, (x,y) plotted
and the points connected for group 2, etc., but you do not want the points
in one group connected to the points in another. Were you to type
. graph y x, c(l)
you would obtain a graph with all the points connected. Were you to type
. graph y x, c(l) by(group)
you would obtain nearly what you want, but you would obtain separate graphs
for each group. Let's assume you want all the points in one graph. Type
. sort group x
. graph y x, c(L)
Below we explain why this often works and why it sometimes does not, and we
show why
. egen xmin = min(x), by(group)
. egen xmax = max(x), by(group)
. gsort -xmin -xmax group x
. graph y x, c(L)
is a better solution.
We also show how to draw other graphs with distinct line segments.
Let’s look at some examples, modeled closely on questions that arose
on Statalist.
Connecting points within groups
Question:
I have panel data on the weights of several hundred babies at different
ages. I want a plot in which each individual is represented by a
distinct connected line.
Here are my data
age (weeks) baby_id weight (kg)
2 123 20
3 123 24
4 123 28
... ... ...
2 654 19
3 654 23
4 654 27
... ... ...
(yes, these are hefty babies).
To reiterate, I want to plot weight against age, by baby_id, connecting
the points for each baby.
Answer:
If you are lucky, you will need to type no more than
. sort baby_id age
. graph weight age, c(L)
We are putting the data in order of babies and, within each baby, the age.
Then we are connecting the points from left to right. This will work if
the youngest age of each baby is younger than the oldest age of the baby
that precedes it because
- Stata will start with the first baby, plot its points, and connect
them (the points will be connected because they proceed from left
to right as we sorted the points in order of age within baby).
- Stata will then proceed to the second baby, repeating the process.
Stata will not connect the last point of the first baby to the
first point of the second baby as long as the first age of the second
baby is less than the last age of the first because that would
violate the left-to-right rule.
In real data, however, there might be some problems if babies drop out of or
enter a study in the middle. Suppose that, after sorting our data, the last
observation on one baby (baby 888) and the first observation on the next
(baby 889) are
age (weeks) baby_id weight (kg)
21 888 45
24 889 34
Baby 889 is older (24 weeks) at its youngest than the preceding baby at its
oldest (21 weeks). Stata will draw a line connecting these two babies
because variable age is increasing.
The way around this is to order the babies so that this does not happen.
Let’s call age0 the youngest age at which each baby is observed.
Then we want to order the babies so that the babies with the largest values
of age0 occur first in the data. Doing that will ensure, when we proceed
from one baby to the next, age decreases which, in turn, will prevent
c(L) from connecting the points between.
Obtaining the earliest age (minimum value of x) is easy,
. egen age0 = min(age), by(baby_id)
Putting the data in order so that oldest babies occur first is easy:
. gsort -age0 baby_id age
gsort is a variation on Stata’s sort command; it allows
us to put the data in ascending or descending order. We specify
-age0 to obtain descending order on age0.
Now we are ready to draw our graph. Putting this all together, we type
. egen age0 = min(age), by(baby_id)
. gsort -age0 baby_id age
. graph weight age, c(L)
In other cases, we might need to sort even more carefully using both the
minimum and the maximum age recorded for each baby.
. egen age0 = min(age), by(baby_id)
. egen agex = max(age), by(baby_id)
. gsort -age0 -agex baby_id age
. graph weight age, c(L)
For the general problem, if we want to graph y versus x, connecting the
points within the group, the solution is
. egen xmin = min(x), by(group)
. egen xmax = max(x), by(group)
. gsort -xmin -xmax group x
. graph y x, c(L)
Unfortunately, even this code is not bulletproof if we have the following
situation, illustrated yet again by baby weights.
- Some babies appear in the study just once, but all at the
same time.
- Therefore, the minimum value of the x variable and the maximum value of
the x variable are all the same, and sorting makes no difference to
that, because, although we can shuffle those babies, the corresponding
data points still have the same x value.
The way to fix this is to add some random noise to the x variable so
that the ages differ. First find the minimum gap between x values —
in our example with time in weeks, suppose it is one week. We want to
add random noise that is small relative to that. uniform( )
generates uniform random numbers between 0 and 1, so we try
. gen x2 = x + 0.01 * (uniform( ) - 0.5)
. egen xmin = min(x2), by(group)
. egen xmax = max(x2), by(group)
. gsort -xmin group x2
. gsort -xmax group x2
. graph y x2, c(L)
The random noise will be at most 0.005 and at least -0.005. For
presentation purposes, we need to work at the axis titles as well.
Skipping gaps
Question:
I have time data with gaps. Data should have been measured regularly, but
there are some observations with missing values (somebody was sick, we
lost the record, whatever).
time var1 var2
1 3 4
2 4 5
3 5 6
4 . .
5 6 7
6 6 6
7 . .
8 5 5
9 4.5 5.2
graph var1 var2 time, c(l) draws lines boldly jumping across the
gaps. Instead, I want an honest graph showing breaks.
Answer:
If we could define a group variable that tied together contiguous
observations, this would be the same problem as the one we just handled.
Here is how we make that variable:
Define the groups. We can set up a counter
. gen block = sum(var1 == .)
var1 == . is 1 if var1 is missing and 0 if var1 is present. As we sum
them, we get
time var1 var2 block
1 3 4 0
2 4 5 0
3 5 6 0
4 . . 1
5 6 7 1
6 6 6 1
7 . . 2
8 5 5 2
9 4.5 5.2 2
Every time we find a new missing value the counter jumps by 1. But notice
that for this to work properly, it is essential that the data are sorted by
time. Our data was sorted, but to be safe about it, we would have typed
. sort time
. gen block = sum(var1 == .)
Now that we have a group variable, we follow our generic solution, which is
. egen xmin = min(x), by(group)
. gsort -xmin group x
. graph y x, c(L)
In this case, group=block, x=time, and we have two y variables, var1 and
var2. Substituting, we would type
. egen tmin = min(time), by(block)
. gsort -tmin block time
. graph var1 var2 time, c(L)
So, the complete solution is
. sort time
. gen block = sum(var1 == .)
. egen tmin = min(time), by(block)
. gsort -tmin block time
. graph var1 var2 time, c(L)
|