[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: Problem with stset and sttocc

From   Mike Lacy <>
Subject   Re: st: Problem with stset and sttocc
Date   Wed, 24 Oct 2007 09:41:44 -0600

>Date: Tue, 23 Oct 2007 11:45:13 +0100
>From: Andrew Hall <>
>Subject: st: Problem with stset and sttocc
>I am having problems using stset and sttocc to create a case-control
>data set from a simple cross-sectional data set. Sttocc is selecting
>cases as controls. I am using Intercooled STATA 9.2 on Windows
>XP. This is what is happening.
>I have cross-sectional survey data on 7,572 Ethiopian schoolchildren
>of whom 1,283 are orphans and 6,289 are non-orphans. The variable
>orphan is coded as 1 = orphan, 0 = non orphan.
>The data were collected between 28/11/2006 and 8/2/2007.
>I want to randomly select a non-orphan for each orphan matched on sex
>and age at least to create a case-control data set.
>I have stset the dataset using either date of visit (dov1) as the
>time variable or created a new time variable fixed on one arbitrary
>date e.g. 01/01/2007 (dovfixed)
> stset dov1, failure(orphan=1)
> or stset dovfixed, failure(orphan=1)
>This creates new temporary variables including _d which cross-tabs
>perfectly with orphan (_d = 1 and orphan = 1 n=1283, _d =0 and orphan
>= 0, n=6289). It seems that the dataset have been properly stset (or has it?).
>I then use sttocc to match each case to one control on the variables
>sex (1=male; 2=female) and ageyrs (in years) as follows:
> sttocc, match (sex ageyrs) number(1)
>This works and cannot find controls for 2 cases only.
>But when I do a cross-tab of orphan by _case I find that 278
>controls who should be non-orphans have been selected from the cases
>(orphans, failure=1). All controls should be selected from the non-orphans.
>Snapspan has no effect on the dataset as all id numbers in the data
>set are unique anyway.
>Why are orphans (failure) being selected as controls for orphans
>(failure) when I have specified non-orphans?

I'm not certain about all the details of how you are using stocc, but I can think of one possible misunderstanding that might be confusing you: Since -stocc- purports to do genuine incidence-density sampling, i.e. sampling controls from the risk set at time = t for a case that occurs at time = t, -stocc- might well and correctly select a control that later becomes a case. So, for example, suppose you have a child ABC that becomes orphaned at time3. Suppose that, at time3, there are 200 subjects that have not yet become orphaned, i.e., are in the risk set at time3. Suppose that one of them, child XYZ, is chosen from this risk set as one of the controls for child ABC, but let's further suppose that child XYZ, becomes a case at time5. If you are thinking of controls as "children who never experience the event," this could produce confusing results, since you would be defining controls in a way contradictory to the idea of sampling from the risk set.

A more general question would whether a case-control study with this data at all, rather than an survival analysis. Perhaps there are some cost/effort savings (e.g., collecting additional explanatory variables) here that were not relevant to mention (quite possible), but otherwise it sounds like you have the whole data set in hand, which would make me think "why sample?".


Mike Lacy
Fort Collins CO USA
(970) 491-6721 office

* For searches and help try:

© Copyright 1996–2017 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index