Home  /  Resources & support  /  FAQs  /  Connecting points within groups
Note: This FAQ is relevant for users of releases prior to Stata 8.

How do I connect points only within groups?

Title   Connecting points within groups
Author Nicholas J. Cox, Durham University, UK

To connect points with straight lines on a two-way graph, specify graph’s connect() option:

. graph y x, connect(l)

The l inside the parentheses is called the connect style, and connect(l) is probably the most common. It connects all the points shown on the graph, joining them according to the current sort order of the data. The connect() option may be abbreviated all the way down to c(), so most people would type

. graph y x, c(l)

Understand that c(l) connects the points in the order of the data. If the data are time series in time order, this gives a line graph showing successive changes, say, from year to year. If the data are in some other order, c(l) may be useful for showing trajectories in the space defined by any two variables.

Another connect style is c(L): it joins points if and only if successive values of the x variable (on the horizontal axis) are in ascending order. To be precise, this does include cases where values of the x variable are constant. This may sound like a rather special case, but c(L) can be very useful for ensuring that points are joined only in groups.

The general recipe for connecting points within groups consists of three steps:

  1. Make sure the groups are identifiable to Stata. If a categorical variable defining the groups does not already exist, use generate or egen to produce it.
  2. Sort the observations into the order needed, using either sort or gsort. The principle is often that the last shall be first, and the first shall be last.
  3. Use graph (or some other graph command) with c(L).

Say you wish to plot y versus x, connecting the points, by group. That is, you want (x,y) plotted and the points connected for group 1, (x,y) plotted and the points connected for group 2, etc., but you do not want the points in one group connected to the points in another. Were you to type

. graph y x, c(l)

you would obtain a graph with all the points connected. Were you to type

. graph y x, c(l) by(group)

you would obtain nearly what you want, but you would obtain separate graphs for each group. Let's assume you want all the points in one graph. Type

. sort group x . graph y x, c(L)

Below we explain why this often works and why it sometimes does not, and we show why

. egen xmin = min(x), by(group) . egen xmax = max(x), by(group) . gsort -xmin -xmax group x . graph y x, c(L)

is a better solution.

We also show how to draw other graphs with distinct line segments.

Let’s look at some examples, modeled closely on questions that arose on Statalist.

Connecting points within groups

Question:

I have panel data on the weights of several hundred babies at different ages. I want a plot in which each individual is represented by a distinct connected line.

Here are my data

      age (weeks)     baby_id     weight (kg)
          2             123           20
          3             123           24
          4             123           28
         ...            ...          ...
          2             654           19
          3             654           23
          4             654           27
         ...            ...          ...

(yes, these are hefty babies).

To reiterate, I want to plot weight against age, by baby_id, connecting the points for each baby.

Answer:

If you are lucky, you will need to type no more than

. sort baby_id age . graph weight age, c(L)

We are putting the data in order of babies and, within each baby, the age. Then we are connecting the points from left to right. This will work if the youngest age of each baby is younger than the oldest age of the baby that precedes it because

  1. Stata will start with the first baby, plot its points, and connect them (the points will be connected because they proceed from left to right as we sorted the points in order of age within baby).
  2. Stata will then proceed to the second baby, repeating the process. Stata will not connect the last point of the first baby to the first point of the second baby as long as the first age of the second baby is less than the last age of the first because that would violate the left-to-right rule.

In real data, however, there might be some problems if babies drop out of or enter a study in the middle. Suppose that, after sorting our data, the last observation on one baby (baby 888) and the first observation on the next (baby 889) are

       age (weeks)   baby_id     weight (kg)
          21           888           45
          24           889           34

Baby 889 is older (24 weeks) at its youngest than the preceding baby at its oldest (21 weeks). Stata will draw a line connecting these two babies because variable age is increasing.

The way around this is to order the babies so that this does not happen. Let’s call age0 the youngest age at which each baby is observed. Then we want to order the babies so that the babies with the largest values of age0 occur first in the data. Doing that will ensure, when we proceed from one baby to the next, age decreases which, in turn, will prevent c(L) from connecting the points between.

Obtaining the earliest age (minimum value of x) is easy,

. egen age0 = min(age), by(baby_id)

Putting the data in order so that oldest babies occur first is easy:

. gsort -age0 baby_id age

gsort is a variation on Stata’s sort command; it allows us to put the data in ascending or descending order. We specify -age0 to obtain descending order on age0.

Now we are ready to draw our graph. Putting this all together, we type

. egen age0 = min(age), by(baby_id) . gsort -age0 baby_id age . graph weight age, c(L)

In other cases, we might need to sort even more carefully using both the minimum and the maximum age recorded for each baby.

. egen age0 = min(age), by(baby_id) . egen agex = max(age), by(baby_id) . gsort -age0 -agex baby_id age . graph weight age, c(L)

For the general problem, if we want to graph y versus x, connecting the points within the group, the solution is

. egen xmin = min(x), by(group) . egen xmax = max(x), by(group) . gsort -xmin -xmax group x . graph y x, c(L)

Unfortunately, even this code is not bulletproof if we have the following situation, illustrated yet again by baby weights.

  1. Some babies appear in the study just once, but all at the same time.
  2. Therefore, the minimum value of the x variable and the maximum value of the x variable are all the same, and sorting makes no difference to that, because, although we can shuffle those babies, the corresponding data points still have the same x value.

    The way to fix this is to add some random noise to the x variable so that the ages differ. First find the minimum gap between x values — in our example with time in weeks, suppose it is one week. We want to add random noise that is small relative to that. uniform( ) generates uniform random numbers between 0 and 1, so we try
. gen x2 = x + 0.01 * (uniform( ) - 0.5) . egen xmin = min(x2), by(group) . egen xmax = max(x2), by(group) . gsort -xmin group x2 . gsort -xmax group x2 . graph y x2, c(L)

The random noise will be at most 0.005 and at least -0.005. For presentation purposes, we need to work at the axis titles as well.

Skipping gaps

Question:

I have time data with gaps. Data should have been measured regularly, but there are some observations with missing values (somebody was sick, we lost the record, whatever).

       time   var1    var2
        1       3       4
        2       4       5
        3       5       6
        4       .       .
        5       6       7
        6       6       6
        7       .       .
        8       5       5
        9       4.5     5.2

graph var1 var2 time, c(l) draws lines boldly jumping across the gaps. Instead, I want an honest graph showing breaks.

Answer:

If we could define a group variable that tied together contiguous observations, this would be the same problem as the one we just handled.

Here is how we make that variable:

Define the groups. We can set up a counter

. gen block = sum(var1 == .)

var1 == . is 1 if var1 is missing and 0 if var1 is present. As we sum them, we get

       time   var1    var2    block
        1       3       4       0
        2       4       5       0
        3       5       6       0
        4       .       .       1
        5       6       7       1
        6       6       6       1
        7       .       .       2
        8       5       5       2
        9       4.5     5.2     2

Every time we find a new missing value the counter jumps by 1. But notice that for this to work properly, it is essential that the data are sorted by time. Our data was sorted, but to be safe about it, we would have typed

. sort time . gen block = sum(var1 == .)

Now that we have a group variable, we follow our generic solution, which is

. egen xmin = min(x), by(group) . gsort -xmin group x . graph y x, c(L)

In this case, group=block, x=time, and we have two y variables, var1 and var2. Substituting, we would type

. egen tmin = min(time), by(block) . gsort -tmin block time . graph var1 var2 time, c(L)

So, the complete solution is

. sort time . gen block = sum(var1 == .) . egen tmin = min(time), by(block) . gsort -tmin block time . graph var1 var2 time, c(L)