Title | Connecting points within groups | |

Author | Nicholas J. Cox, Durham University, UK |

To connect points with straight lines on a two-way graph, specify
**graph**’s **connect()** option:

The **l** inside the parentheses is called the connect style, and
**connect(l)** is probably the most common. It connects all the points
shown on the graph, joining them according to the current sort order of the
data. The **connect()** option may be abbreviated all the way down to
**c()**, so most people would type

Understand that **c(l)** connects the points in the order of the data.
If the data are time series in time order, this gives a line graph showing
successive changes, say, from year to year. If the data are in some other
order, **c(l)** may be useful for showing trajectories in the space
defined by any two variables.

Another connect style is **c(L)**: it joins points if and only if
successive values of the x variable (on the horizontal axis) are in
ascending order. To be precise, this does include cases where values of the
x variable are constant. This may sound like a rather special case, but
**c(L)** can be very useful for ensuring that points are joined only in
groups.

The general recipe for connecting points within groups consists of three steps:

- Make sure the groups are identifiable to Stata. If a categorical
variable defining the groups does not already exist, use
**generate**or**egen**to produce it. - Sort the observations into the order needed, using either
**sort**or**gsort**. The principle is often that the last shall be first, and the first shall be last. - Use
**graph**(or some other graph command) with**c(L)**.

Say you wish to plot y versus x, connecting the points, by group. That is, you want (x,y) plotted and the points connected for group 1, (x,y) plotted and the points connected for group 2, etc., but you do not want the points in one group connected to the points in another. Were you to type

you would obtain a graph with all the points connected. Were you to type

you would obtain nearly what you want, but you would obtain separate graphs for each group. Let's assume you want all the points in one graph. Type

Below we explain why this often works and why it sometimes does not, and we show why

is a better solution.

We also show how to draw other graphs with distinct line segments.

Let’s look at some examples, modeled closely on questions that arose on Statalist.

I have panel data on the weights of several hundred babies at different ages. I want a plot in which each individual is represented by a distinct connected line.

Here are my data

age (weeks) baby_id weight (kg) 2 123 20 3 123 24 4 123 28 ... ... ... 2 654 19 3 654 23 4 654 27 ... ... ...

(yes, these are hefty babies).

To reiterate, I want to plot weight against age, by baby_id, connecting the points for each baby.

If you are lucky, you will need to type no more than

We are putting the data in order of babies and, within each baby, the age. Then we are connecting the points from left to right. This will work if the youngest age of each baby is younger than the oldest age of the baby that precedes it because

- Stata will start with the first baby, plot its points, and connect them (the points will be connected because they proceed from left to right as we sorted the points in order of age within baby).
- Stata will then proceed to the second baby, repeating the process. Stata will not connect the last point of the first baby to the first point of the second baby as long as the first age of the second baby is less than the last age of the first because that would violate the left-to-right rule.

In real data, however, there might be some problems if babies drop out of or enter a study in the middle. Suppose that, after sorting our data, the last observation on one baby (baby 888) and the first observation on the next (baby 889) are

age (weeks) baby_id weight (kg) 21 888 45 24 889 34

Baby 889 is older (24 weeks) at its youngest than the preceding baby at its oldest (21 weeks). Stata will draw a line connecting these two babies because variable age is increasing.

The way around this is to order the babies so that this does not happen.
Let’s call age0 the youngest age at which each baby is observed.
Then we want to order the babies so that the babies with the largest values
of age0 occur first in the data. Doing that will ensure, when we proceed
from one baby to the next, age decreases which, in turn, will prevent
**c(L)** from connecting the points between.

Obtaining the earliest age (minimum value of x) is easy,

Putting the data in order so that oldest babies occur first is easy:

**gsort** is a variation on Stata’s **sort** command; it allows
us to put the data in ascending or descending order. We specify
**-age0** to obtain descending order on age0.

Now we are ready to draw our graph. Putting this all together, we type

In other cases, we might need to sort even more carefully using both the minimum and the maximum age recorded for each baby.

For the general problem, if we want to graph y versus x, connecting the points within the group, the solution is

Unfortunately, even this code is not bulletproof if we have the following situation, illustrated yet again by baby weights.

- Some babies appear in the study just once, but all at the same time.
- Therefore, the minimum value of the x variable and the maximum value of
the x variable are all the same, and sorting makes no difference to
that, because, although we can shuffle those babies, the corresponding
data points still have the same x value.

The way to fix this is to add some random noise to the x variable so that the ages differ. First find the minimum gap between x values — in our example with time in weeks, suppose it is one week. We want to add random noise that is small relative to that.**uniform( )**generates uniform random numbers between 0 and 1, so we try

The random noise will be at most 0.005 and at least -0.005. For presentation purposes, we need to work at the axis titles as well.

I have time data with gaps. Data should have been measured regularly, but there are some observations with missing values (somebody was sick, we lost the record, whatever).

time var1 var2 1 3 4 2 4 5 3 5 6 4 . . 5 6 7 6 6 6 7 . . 8 5 5 9 4.5 5.2

**graph var1 var2 time, c(l)** draws lines boldly jumping across the
gaps. Instead, I want an honest graph showing breaks.

If we could define a group variable that tied together contiguous observations, this would be the same problem as the one we just handled.

Here is how we make that variable:

Define the groups. We can set up a counter

**var1 == .** is 1 if var1 is missing and 0 if var1 is present. As we sum
them, we get

time var1 var2 block 1 3 4 0 2 4 5 0 3 5 6 0 4 . . 1 5 6 7 1 6 6 6 1 7 . . 2 8 5 5 2 9 4.5 5.2 2

Every time we find a new missing value the counter jumps by 1. But notice that for this to work properly, it is essential that the data are sorted by time. Our data was sorted, but to be safe about it, we would have typed

Now that we have a group variable, we follow our generic solution, which is

In this case, group=block, x=time, and we have two y variables, var1 and var2. Substituting, we would type

So, the complete solution is