# Re: st: RE: Compute a summary variable based on a predefined algorithm

 From n j cox To statalist@hsphsun2.harvard.edu Subject Re: st: RE: Compute a summary variable based on a predefined algorithm Date Wed, 21 Mar 2007 18:48:38 +0000

```Shuaib Kauchali sent an attachment. It got through, but
Statalist etiquette is clear: you should not send attachments.
To most list members, they will be unreadable garbage in any
case. I have edited the remainder, which is rendered difficult
to read by the encoding produced by the attachment.

make the list work well for all. Badly prepared postings
typically just get ignored. Everyone's time is wasted.

Let's start again. The problem as I understand it is as follows.

1. Panel data, identifier -childid-, time -day-.

2. Diarrhoea is present (1) or not present (0) or missing (.).
To begin, we take no news to mean good news, so that missing
is taken as equivalent to 0. So work with a copy:

clonevar diar2 = diar
replace diar2 = 0 if missing(diar)

3. A "spell" of diarrhoea is defined like this:
(a) Child suffers (1).
(b) Gaps within spell of one or two 0s are allowed. (Gaps of one or
two missings are also allowed, but these have been recoded as 0s
by 2. above.)

So, gaps of three or more 0s mean that any following 1s
start a new spell. The word "run" is also used in some literature,
but is better avoided in this context.

My previous post referred you to -tsspell- on SSC. Allowing gaps of two
observations within spells is a problem discussed and solved in the help
file of -tsspell-!  Evidently, you did not read it.

Let us do it directly, any way. First, "fill in" gaps of one or two.
We do this on -diar2-, so the original data remain unchanged.

bysort childid (day) : replace diar2 = 1 ///
if diar2 == 0 & diar2[_n-1] == 1 & diar2[_n+1] == 1
by childid: replace diar2 = 1 ///
if diar2 == 0 & diar2[_n-1] == 1 & diar2[_n+2] == 1
by childid: replace diar2 = 1 ///
if diar2 == 0 & diar2[_n-2] == 1 & diar2[_n+1] == 1

Now identify spells of diarrhoea with an indicator for
the first day:

by childid: gen byte spell_first = diar2 == 1 & diar2[_n-1] != 0

and an identifier for each spell:

by childid: gen spell_id = cond(diar2 == 1, sum(spell_first), 0)

Their lengths are then

bysort childid spell_id: gen spell_length = cond(spell_id == 0, 0, _N)

and the numbers of spells of >= 14 days are

by childid: egen no_gt14 = total(spell_first * (spell_length >= 14))

This variable is the same for every observation of a given
child. To get summary statistics using information just once
from any child,

egen tag = tag(childid)
tab no_gt14 if tag
su no_gt14 if tag

Here is the code in one:

------------------------------
clonevar diar2 = diar
replace diar2 = 0 if missing(diar)
bysort childid (day) : replace diar2 = 1 ///
if diar2 == 0 & diar2[_n-1] == 1 & diar2[_n+1] == 1
by childid: replace diar2 = 1 ///
if diar2 == 0 & diar2[_n-1] == 1 & diar2[_n+2] == 1
by childid: replace diar2 = 1 ///
if diar2 == 0 & diar2[_n-2] == 1 & diar2[_n+1] == 1
by childid: gen byte spell_first = diar2 == 1 & diar2[_n-1] != 0
by childid: gen spell_id = cond(diar2 == 1, sum(spell_first), 0)
bysort childid spell_id: gen spell_length = cond(spell_id == 0, 0, _N)
by childid: egen no_gt14 = total(spell_first * (spell_length >= 14))
egen tag = tag(childid)
tab no_gt14 if tag
su no_gt14 if tag
-----------------------------

Nick
n.j.cox@durham.ac.uk

----------------------------- Shuaib #3
I tried this <...>

bysort childid (day):
gen first3 = diar==1 & diar[_n-1]!=1 & diar[_n-2]!=1 & diar[_n-3]!=1

Here I got the desired results: i.e. correct definition of an episode of
-diar-.

<reference to attachment deleted>

Now the next challenge is to define persistent diarrhoea
(a string of >= 14 consecutive -diar- days).

e.g. diar 00011111111111111000110011101111111111000000  would be 2
episodes of persistent diarrhoea (note the second episode has some
diarrhoea-free days, but do not amount to >= 3 days,so it is still the
same episode). This will require marking of the beginning of an episode
(in my case I have done this with -first3-) and last day of the episode
(I am not sure how to derive this). Once this is done, then we can
compute the duration between -first3- and last day (lastday).

Again, I have not thought about missing value for -diar- in any of these
definitions. For now I am assuming they are diarrhoea-free days.
------------------------------ end Shuaib #3

------------------------------ Nick Cox #1
This problem is easier than you think in that no use of looping
(-foreach- etc.) is needed. It is difficult in that there are
different possible reactions to missings on -v1-. This
post indicates one kind of solution.

You have panel data. You could -tsset- it without loss:

tsset childid day

That means that you could then use -tsspell- from SSC.
Alternatively, you can work from first principles.
I show the latter, but you might want to look at -tsspell- too.

On one definition, each episode of diarrhea (in English,
diarrhoea) starts when v1 is 1 and the preceding value is not 1:

bysort childid (day): gen first =  v1 ==  1 & v1[_n-1] !=  1

-first- is an indicator variable. You can use it to define
episodes:

by childid : gen episodes =  sum(first)

_or_

by childid : gen episodes =  cond(v1 ==  0, 0, sum(first))

You can record the start dates of each episode:

by childid : gen start =  day if first
by childid : replace start =  start[_n-1] if !first

The time since the previous start is then

by childid : gen time_since =  start - start[_n-1] if first

and you are then interested in counting how many episodes
are not within three days of the previous:

by childid : egen n_episodes =  total(first * (time_since =  3))

The first episode is always included on this definition.
-----------------------------

----------------------------- Shuaib #1
I have data set of birth cohort data with longitudinal
follow-up of these
children till they were 9 months old (270 days), unless they
were lost to follow up or died before then.

the data structure looks like this:
Childid (repeated group variable, daily visit to the clinic)
day (day of visit)
v1 (diarrhea on that day of visit)
v2 <--this is the variable I would like to get(defined as diarrhea
episodes: a string of 1's separated by at least 3 consecutive
0's is an episode)

childid day v1  v2
1   1   .   .
1   2   .   .
1   3   .   .
1   4   .   .
2   1   0   1
2   2   1   1
2   3   1   1
2   4   0   1
3   1   1   2
3   2   1   2
3   3   0   2
3   4   0   2
3   5   0   2
3   6   1   2
3   7   1   2
4   1   1   1
4   2   .   1
4   3   1   1
4   4   0   1
4   5   .   1
4   6   0   1
4   7   0   1
5   1   0   1
5   2   1   1
5   3   0   1
5   4   1   1
6   1   0   0
6   2   0   0
6   3   0   0
6   4   0   0
6   5   0   0
6   6   0   0
6   7   0   0

Note:
1. childid= 4 is a bit tricky because of missing values; we
assume the episode to be one as there were not more than 3 days
separating 2 events.

2. childid= 1 has not had any visits recorded, so he gets
missing values for v2.

3. not everyone is followed-up for the same period: loss to
follow-up,
death, or completed the study (in my data set this should
happen when the
child reaches 270 days from birth. This is a birth cohort of
2500 children)

My problem is I am unable to manipulate the data in Stata to
get me the
summary -v2- of the number of episodes of diarrhea per child by
total number of days observed.
<snip>
----------------------------------------

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```