# Re: st: Gene-incidence question/simulation

 From Austin Nichols To statalist@hsphsun2.harvard.edu Subject Re: st: Gene-incidence question/simulation Date Sun, 22 Mar 2009 12:02:10 -0400

```moleps islon <moleps2@gmail.com> :
Just to be clear: B causes Z and B causes A, but you don't observe B,
right? Let's ignore the survival model you are no doubt estimating,
and suppose you have gotten an estimate of P(Z|A)=.05 with a SE near
zero (a confidence interval of width zero).  Now you want to estimate
P(Z|B) and P(A|B), and you think P(Z|B) is near .65 and
P(Z|~B)=6/100000 (I assume "background incidence" is the probability
of Z given not B here; that may reflect my "background ignorance").

Let p=P(B) in the population, y=P(Z|B), x=P(A|B), and w=P(A|~B). Note
that ~B means "not B" or B==0. Then

P(Z|A)=P(Z|B)P(B|A)+P(Z|~B)P(~B|A)=[ypx+.00006(1-p)w]/[(1-p)w+px]

so even if you assume P(Z|A)=.05 and y=.65, you have 3 unknowns and 1
equation; even if you know p, you have two unknowns w and x, so the
best you can hope for is to express P(A|B) as a linear function of
P(A|~B).  For example, if p=.5 and y=.65 and P(Z|A)=.05 then w is 12
times as big as x (i.e. if Z is so rare in a sample of A, when B so
likely causes Z, it must be because A is much more likely when not B
than when B).  If p is 8% then w and x are roughly the same.  I
suggest you draw out a couple of trees with probabilities and check my
math.

If you want to estimate y and x, you are out of luck.  If you know w
and p with certainty, you can express y as a function of x and the
estimate of P(Z|A), so if you have estimates of P(Z|A) in memory, you
can use -lincom- to get estimates of y conditional on x, but how
plausible is it you would know w with certainty when you are trying to
estimate x and y?

I suppose you could use known p, estimates of P(Z|A) in memory, and
-lincom-, to get estimates of y conditional on x and w, then present a
table of point estimates and confidence intervals for various values
of x and w.  Or get estimates of x conditional on y and w, or what
have you.  But you still have to assume you know p with certainty, or
the dimension of that table gets out of control...

I have been assuming that P(Z|A) is what you are estimating, but you
really have a competing risk model, I am guessing, modeling the hazard
of getting Z before death or censoring by some other process. So you
need to redefine Z to be not "gets condition Z" but  "gets condition Z
in my observation period" to use any of the above, which is probably
unpalatable.  Plus, I don't know if I've translated your description
into probabilities correctly--the jargon of genetics is unfamiliar to
me (and many other list members--you should translate to the common
language of statistics).

On Sun, Mar 22, 2009 at 10:37 AM, moleps islon <moleps2@gmail.com> wrote:
> Dear statalisters,
> I'm studying a tumor A that has a probability (x) of a being linked to
> a genetic mutation (B) that also predisposes (penetrance approx 65%(y)
> by 70 years) to condition Z. Now I've got 217 cases of A that resulted
> in 11 cases of Z over 8534 years of followup years (among the 217
> cases). I need to determine the number of patients with B given that
> there is also a background incidence of 6/100000 for Z.We know that
> x<<y. Besides running a simulation is there a more analytical way of
> estimating x and y given my data???
>
> Best wishes,
> Moleps
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```