Search
>> Home >> Products >> Training >> NetCourses >> Sample lecture NetCourse 101
The following is a 3-page extract from the 32 pages of Lecture 4, NetCourse® 101, An Introduction to Stata.

## Sample lecture from NetCourse 101

### Wide versus long data

The two representations of the husbands-and-wives data above are called the wide and the long forms, and the choice between forms arises in a variety of contexts. Consider these two datasets:

Wide form
id sex inc80 inc81
1 0 5000 5500
2 1 2000 2200
3 0 3000 2000

Long form
id year sex inc
1 80 0 5000
1 81 0 5500
2 80 1 2000
2 81 1 2200
3 80 0 3000
3 81 0 2000

Both datasets record the same data, but they organize the data differently. Whenever you deal with data where there are repeated observations, the organizational issue arises, and there is no right answer to the question.

• Suppose that I want to look at income growth by sex. The wide form makes that easy to see. I might, for instance, type
. generate gro = (inc81-inc80)/inc80

. regress gro sex

• Suppose that I want to study income level by sex. Now it is the long form that more easily shows the answer to the question. I might type
. xtset id

. xtreg inc sex, be


The statistical commands in my examples are used for illustration—what is important is realizing that sometimes I want one form and sometimes the other.

Given either of the above datasets, Stata can make the other, and Stata can convert it back again. You already know one way to do this because, conceptually, there is no difference between the 2-years-of-income example and the previous husbands-and-wives data example.

In fact, Stata has a command to make switching between forms easier, and this command will deal not just with two members within a group (such as husbands and wives or one year and another), but with many. First, we need some jargon:

Wide form
constants variables
id sex inc80 inc81
group 1--> 1 0 5000 5500
group 2--> 2 1 2000 2200
group 3--> 3 0 3000 2000
Long form
constant "the" grouping variable constant variable
id year sex inc
group 1--> 1 80 0 5000
1 81 0 5500
group 2--> 2 80 1 2000
2 81 1 2200
group 3--> 3 80 0 3000
3 81 0 2000

A variable is called a "within-group constant", or just a "constant", if its value does not vary within a group. Variables id and sex are constants.

A variable is called a "within-group variable", or just a "variable", if its value varies within a group. In the wide form, the within-group variables are separate dataset variables; we have variables inc80 and inc81. In the long form, within-group variables are single dataset variables; we have variable inc, and another variable—the "grouping variable" (year)—serves to identify the groups.

We can get from one form to the other easily with the reshape command. Let me first redraw the wide and long data. Think of the data as a collection of observations x_ij.

Wide form
i   x_ij
id sex inc80 inc81
1 0 5000 5500
2 1 2000 2200
3 0 3000 2000

Long form
i j   x_ij
id year sex inc
1 80 0 5000
1 81 0 5500
2 80 1 2000
2 81 1 2200
3 80 0 3000
3 81 0 2000

The information reshape needs is the identity of the "i" variable(s), the identity of the "j" variable(s), and the identity of the "x_ij" variable(s). The syntax is easy.

. reshape long inc, i(id) j(year)   (goes from wide to long)

. reshape wide inc, i(id) j(year)   (goes from long to wide)


After the reshape long or the reshape wide, we specify the "x_ij" variable name (when in long form) or variable stub name (when in wide form): inc.

The i() option identifies each logical observation—the i subscript, id. (Think in terms of the data in wide form.)

The j() option identifies the name of the grouping variable—the j subscript, year. Stata figures out the values contained in the grouping variable (80 and 81) from either the variable names when in wide form or the variable values when in long form.

Notice that we do not specify the sex variable. With reshape, the unspecified variables should be constant within each level of the i() variables. If this is not true, reshape will give you an informative error message. I introduced an error into the long form of the data, and then I tried to type

. reshape wide inc, i(id) j(year)
(note:  j = 80 81)
sex not constant within id
Type "reshape error" for a listing of the problem observations.
r(9);


Stata saw that there was an error—sex was not constant within id. We can find out more with the reshape error command:

. reshape error
(note:  j = 80 81)

i (id) indicates the top-level grouping such as subject id.
j (year) indicates the subgrouping such as time.
xij variable is inc.
Thus, the following variable(s) should be constant within i:
sex

sex not constant within i (id) for 1 value of i:

id       year        sex
5.         3         80          0
6.         3         81          1

(data now sorted by id year)


reshape found the observation where I introduced the error.

Once we have used reshape once, Stata understands the structure, and if the data are currently wide, we can simply type

. reshape long


and the data are switched to the long form. On the other hand, if the data are currently long, we can type

. reshape wide


and the data are switched to the wide form.

In fact, once we have given the definitions, for the remainder of our session we can switch back and forth by typing reshape wide and reshape long without redefining the groups, variables, and constants.

For instance, I loaded the data in wide form:

. list

+--------------------------+
| id   sex   inc80   inc81 |
|--------------------------|
1. |  1     0    5000    5500 |
2. |  2     1    2000    2200 |
3. |  3     0    3000    2000 |
+--------------------------+

. reshape long inc, i(id) j(year)
(note: j = 80 81)

Data                               wide   ->   long
-----------------------------------------------------------------------------
Number of obs.                        3   ->       6
Number of variables                   4   ->       4
j variable (2 values)                     ->   year
xij variables:
inc80 inc81   ->   inc
-----------------------------------------------------------------------------

. list

+------------------------+
| id   year   sex    inc |
|------------------------|
1. |  1     80     0   5000 |
2. |  1     81     0   5500 |
3. |  2     80     1   2000 |
4. |  2     81     1   2200 |
5. |  3     80     0   3000 |
|------------------------|
6. |  3     81     0   2000 |
+------------------------+


I can now convert back to the wide form by simply typing

. reshape wide
(note: j = 80 81)

Data                               long   ->   wide
-----------------------------------------------------------------------------
Number of obs.                        6   ->       3
Number of variables                   4   ->       4
j variable (2 values)              year   ->   (dropped)
xij variables:
inc   ->   inc80 inc81
-----------------------------------------------------------------------------

. list

+--------------------------+
| id   inc80   inc81   sex |
|--------------------------|
1. |  1    5000    5500     0 |
2. |  2    2000    2200     1 |
3. |  3    3000    2000     0 |
+--------------------------+


If I typed reshape long, the data would be long again.

There is nothing magical about having data for 2 years within a group; both the manual (see [D] reshape) and the online help (see help reshape) show this example with 3 years of data, and there is no reason you cannot use reshape with 4, 5, or more years of data. Nor are you limited to one within-group variable. In some other example, we might have typed

. reshape long inc hours wksue, i(id) j(year)


to go from wide to long form. Say that year takes on values from 80 to 88 and there are some additional unspecified variables (sex, age, and ownshome).

Thus, reshape long inc hours wksue, i(id) j(year) says that the variables are

Wide form Long form
id id
sex, age, ownshome sex, age, ownshome
year
inc80, inc81, ..., inc88 inc
hours80, hours81, ..., hours88 hours
wksue80, wksue81, ..., wksue88 wksue

There are other more advanced features of reshape that you can learn about in the manual (see [D] reshape). Most cases can be handled with the simple syntax I have illustrated.