Manipulating large datasets in Stata: how do we manage?

Speakers  Michael Rosato, Office of National Statistics
Seeromanie Harding, Office of National Statistics
E. McVey, Office of National Statistics
Date  5 June 1997

This paper discusses the manipulation of large datasets in Stata using the Office of National Statistics Longitudinal Study. The study is based on a one per cent sample of the population of England and Wales (about 650,000 persons) with 22 years of follow-up and is the largest cohort study in this country.

We present findings on socio-economic variations in health using two examples of current analyses. For both of these analyses we used Cox regression models in Stata. The first shows the impact of social class mobility on mortality of middle aged men and the second examines the incidence of cancers among second generation Irish living in England and Wales.

Previously, analysis of Longitudinal Study data was mainly limited to descriptive statistics as use of individual level data was restricted to mainframe computing. This made it difficult to implement the advances in software for statistical modelling. Recent changes in protocols have enabled analysis of individual level data in a PC environment using Stata. This has brought new problems associated with the large size of the datasets and the capability of the machines. We discuss the problems encountered and the methods used to overcome the difficulties involved in analysing such a large national datasets in Stata.

