Home / Resources & support / FAQs / Generating variables that contain repeating sequences of numbers

Title | Stata 6: Generating variables that contain repeating sequences of numbers | |

Author | David Reichel, StataCorp |

Sometimes, it is valuable to generate a variable that contains a sequence of numbers in a particular pattern. Such a variable could be used as part of a match-merge procedure to give a certain shape or structure to the resulting dataset. For example, it may be useful to create a variable that contains observation identifiers or an automatic numbering of levels of factors or categorical variables.

The **fill()** function of the
**egen** command is
remarkably useful for this purpose. To create a variable that repeats the
pattern

10 10 12 12 20

you could write the following commands:

set obs 1000 egen seq = fill(10 10 12 12 20 10 10 12 12 20)

This would create a variable** seq** with 1000 observations, which would
repeat the sequence 200 times. A somewhat complicated pattern considering it
must be repeated twice inside of the parentheses to inform Stata of the
exact pattern desired.

Please note:

- You will need to use a
**set obs**command or have a dataset already open so that Stata will know how many observations to generate. - The
**list**command will display each sequence vertically. In this article, however, sequences will be listed horizontally.

Two commands developed by N. J. Cox are also useful. The first is the
**seq** command (Stata Technical Bulletin 37, dm44), which can be
downloaded for free (type **help net** for details). **seq** creates
a new variable that contains a sequence of integers such as

1 2 3 1 2 3 1 2 3

or

1 1 1 2 2 2 3 3 3

The command can specify the beginning number (f), the ending number (t), and how many times each number is repeated (b). For example, the two sequences above can be generated by the commands

seq a, f(1) t(3)

and

seq b, f(1) t(3) b(3)

This command can use initial integers other than **1** and can produce
decreasing sequences. It also supports **by**, **if**, and **in**.

A similar function can be found in Stata Technical Bulletin 50, dm70,
“Extensions to generate, extended”, by N. J. Cox. The syntax is
different. It requires the **egen** command and also uses the **seq**
command but with parentheses added. For example,

egen d = seq(), f(10) t(12)

generates the sequence:

10 11 12 10 11 12 10 11 12

Although slightly more complex to use, this command is designed to give more consistent results with datasets that require sorting.

There are two additional functions to consider. They are both associated with the generate command.

To generate a sequence of numbers like

1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6

the **mod(x,y)** function can be used. This function returns the
remainder when **x** is divided by **y**. Different sequences of
consecutive numbers can be generated by using an expression that includes
**_n** for **x** and by setting **y** equal to the total number of
observations within each repeated pattern. **_n** is called an
“underscore variable”. It is a built-in system variable that
contains the number of the current observation.

For example, to generate the repeated sequence above, type

gen seq2 = mod(_n-1,6) + 1

It is valuable to experiment with the **mod()** function to see what
results can be obtained. For example, try using **_n** instead of
**_n-1** in the formula, and try removing the **+ 1** at the end of
the function. To increment by two instead of by one, simply multiply the
right side of the equation by 2 and add 2:

gen seq3 = 2*mod(_n-1,6) + 2

This will generate the following sequence:

2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12

The **fill()** function might be easier to use for simple sequences such
as these. If the sequence involves consecutive integers, the **seq()**
function can handle long repeating patterns, which would be tedious to type
out using the **fill()** command. However, if you wanted to generate
non-consecutive numbers (like the above example) from one to one thousand
and do it many times, using the **mod(x,y)** function would save typing.

To repeat each number a specific number of times, specify the block number
in the **seq()** command (as discussed above), or use the **int(x)**
function. This function returns the integer obtained by truncating
**x**. Thus, **int(5.2)** is **5**. If you want the following
repeated pattern

1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9

the command is

gen seq = int((_n-1)/2) +1

Again, it is valuable to experiment with using **_n** instead of
**_n-1** and also eliminating the **+ 1** at the end. You can also
multiply the right side of the equation by any constant to make the sequence
increment by larger or smaller steps between groups of numbers. Dividing by
a number other than 2 can change the length of each repeated group.

What if you need a variable that repeats values within a sequence and repeats the sequence itself? For example,

1 1 2 2 3 3 1 1 2 2 3 3 1 1 2 2 3 3

In this sequence, the **fill()** or the **seq()** functions would
still be the easiest to use, but I will demonstrate an alternative
procedure.

The **mod(x,y)** function and the **int(x)** function can be used
together. The **mod(x,y)** function helped to create a sequence that
incremented by a given amount and was repeated. The **int()** function
allowed us to repeat values within that sequence. To create the above
sequence, type

gen seq = int((mod(_n-1,6))/2) + 1

Notice you can change the length of the sequence by changing the number
**6**, and you can change the number of times that each value
repeats by changing the number **2**.

To generate a sequence like

1 1 1 2 2 2 3 3 3 1 1 1 2 2 2 3 3 3

you can change the **6** to a **9** and the **2** to a **3**.

gen seq = int((mod(_n-1, 9))/3) +1

There are other useful commands for special circumstances. The
**group()** function of the **egen** command is described in a FAQ
written by N. J. Cox and W. Gould entitled
"How do I create individual
identifiers numbered from 1 upwards?"

- Cox, N. J. 1997.
- dm44: Sequences of integers.
*Stata Technical Bulletin*37: 2–4.

- Cox, N. J. 1999.
- dm70: Extensions to generate, extended.
*Stata Technical Bulletin*50: 9–17.