|
Note: This FAQ is for users of Stata 6. It is not relevant for more
recent versions.
Stata 6: How do I create a variable that contains a repeating sequence of numbers?
|
Title
|
|
Stata 6: Generating variables that contain repeating sequences of numbers
|
|
Author
|
David Reichel, StataCorp
|
|
Date
|
December 1999
|
Sometimes, it is valuable to generate a variable that contains a sequence of
numbers in a particular pattern. Such a variable could be used as part of a
match-merge procedure to give a certain shape or structure to the resulting
dataset. For example, it may be useful to create a variable that contains
observation identifiers or an automatic numbering of levels of factors or
categorical variables.
The fill() function of the
egen command is
remarkably useful for this purpose. To create a variable that repeats the
pattern
10 10 12 12 20
you could write the following commands:
set obs 1000
egen seq = fill(10 10 12 12 20 10 10 12 12 20)
This would create a variable seq with 1000 observations, which would
repeat the sequence 200 times. A somewhat complicated pattern considering it
must be repeated twice inside of the parentheses to inform Stata of the
exact pattern desired.
Please note:
- You will need to use a
set obs command
or have a dataset already open so that Stata will know how many
observations to generate.
- The list
command will display each sequence vertically. In this article,
however, sequences will be listed horizontally.
Two commands developed by N. J. Cox are also useful. The first is the
seq command (Stata Technical Bulletin 37, dm44), which can be
downloaded for free (type help net for details). seq creates
a new variable that contains a sequence of integers such as
1 2 3 1 2 3 1 2 3
or
1 1 1 2 2 2 3 3 3
The command can specify the beginning number (f), the ending number (t), and
how many times each number is repeated (b). For example, the two sequences
above can be generated by the commands
seq a, f(1) t(3)
and
seq b, f(1) t(3) b(3)
This command can use initial integers other than 1 and can produce
decreasing sequences. It also supports by, if, and in.
A similar function can be found in Stata Technical Bulletin 50, dm70,
“Extensions to generate, extended”, by N. J. Cox. The syntax is
different. It requires the egen command and also uses the seq
command but with parentheses added. For example,
egen d = seq(), f(10) t(12)
generates the sequence:
10 11 12 10 11 12 10 11 12
Although slightly more complex to use, this command is designed to give more
consistent results with datasets that require sorting.
There are two additional functions to consider. They are both associated
with the generate command.
To generate a sequence of numbers like
1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6
the mod(x,y) function can be used. This function returns the
remainder when x is divided by y. Different sequences of
consecutive numbers can be generated by using an expression that includes
_n for x and by setting y equal to the total number of
observations within each repeated pattern. _n is called an
“underscore variable”. It is a built-in system variable that
contains the number of the current observation.
For example, to generate the repeated sequence above, type
gen seq2 = mod(_n-1,6) + 1
It is valuable to experiment with the mod() function to see what
results can be obtained. For example, try using _n instead of
_n-1 in the formula, and try removing the + 1 at the end of
the function. To increment by two instead of by one, simply multiply the
right side of the equation by 2 and add 2:
gen seq3 = 2*mod(_n-1,6) + 2
This will generate the following sequence:
2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12
The fill() function might be easier to use for simple sequences such
as these. If the sequence involves consecutive integers, the seq()
function can handle long repeating patterns, which would be tedious to type
out using the fill() command. However, if you wanted to generate
non-consecutive numbers (like the above example) from one to one thousand
and do it many times, using the mod(x,y) function would save typing.
To repeat each number a specific number of times, specify the block number
in the seq() command (as discussed above), or use the int(x)
function. This function returns the integer obtained by truncating
x. Thus, int(5.2) is 5. If you want the following
repeated pattern
1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9
the command is
gen seq = int((_n-1)/2) +1
Again, it is valuable to experiment with using _n instead of
_n-1 and also eliminating the + 1 at the end. You can also
multiply the right side of the equation by any constant to make the sequence
increment by larger or smaller steps between groups of numbers. Dividing by
a number other than 2 can change the length of each repeated group.
What if you need a variable that repeats values within a sequence and
repeats the sequence itself? For example,
1 1 2 2 3 3 1 1 2 2 3 3 1 1 2 2 3 3
In this sequence, the fill() or the seq() functions would
still be the easiest to use, but I will demonstrate an alternative
procedure.
The mod(x,y) function and the int(x) function can be used
together. The mod(x,y) function helped to create a sequence that
incremented by a given amount and was repeated. The int() function
allowed us to repeat values within that sequence. To create the above
sequence, type
gen seq = int((mod(_n-1,6))/2) + 1
Notice you can change the length of the sequence by changing the number
6, and you can change the number of times that each value
repeats by changing the number 2.
To generate a sequence like
1 1 1 2 2 2 3 3 3 1 1 1 2 2 2 3 3 3
you can change the 6 to a 9 and the 2 to a 3.
gen seq = int((mod(_n-1, 9))/3) +1
There are other useful commands for special circumstances. The
group() function of the egen command is described in a FAQ
written by N. J. Cox and W. Gould entitled
"How do I create individual
identifiers numbered from 1 upwards?"
References
- Cox, N. J. 1997.
- dm44: Sequences of integers. Stata
Technical Bulletin 37: 2–4.
- Cox, N. J. 1999.
- dm70: Extensions to generate, extended. Stata
Technical Bulletin 50: 9–17.
|