Statalist The Stata Listserver

[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: logistic regression

From   "Neil Shephard" <>
Subject   Re: st: logistic regression
Date   Tue, 13 Mar 2007 10:57:37 +0000

Hi Meena,

Its not surprising that you get different results when you use the
cluster option (see the Clayton & Hills reference for details of why
you get this difference), but in my opinion to perform allele wise
analysis is completely meaningless.

Humans are diploid organisms, that means that they carry two copies of
each allele (viz. gene).  To know the OR of an individual allele is
meaningless since an individual will either carry zero, one or two
copies of an allele.

A more appropriate way of thinking about this is in terms of the
genetic model that confers risk (i.e. in Mendels terms, dominant,
recessive, additive or multiplicative).

When I analyse data I start by using a "robust" test such as the
Cochrane-Armitage trend test (Sasieni 1997) to see if there is any
evidence of association at a locus (as often I am testing many
thousands of loci, and this is faster than performing a lot of
logistic regression test).  If there is evidence of association then I
will then go on to perform a logistic regression analysis of the SNP
to see obtain OR's for each of the genotypes compared to the base line
(in this instance I would take the common homozygote (11 coded as 0)
and obtain OR's for the heterozygote (12 coded as 1) and mutant
homozygote (22 coded as 2)).  From this it is possible to determine
the mode of inheritance.  Under a dominant model the heterozygote and
wild-type heterozygote would have very similar OR's.  Under a
recessive model the OR for the heterozygote would be 1 (or there
abouts, CI would certainly be non-significant), whilst the wild-type
heterozygote would have a significant OR (and CI).  Additive and
multiplicative models would see an increase in OR with the number of
mutant alleles carried.  After that I'd incorporate additional factors
into the model such as environmental covariates that are known to
affect disease to see if the effect of the locus is still present
after accounting for such factors.

If you really want to look at allelic association then I would simply
tabulate the number of alleles in cases and controls in a 2x2
contingency table and calculate an OR.  In this instance you don't
need to worry about matching as, providing your locus is in
Hardy-Weinberg equilibrium, it is safe to assume that the alleles that
each diploid individual is carrying are the result of random mating.
Thus when using the -logistic- command (if you want to account for
additional co-variates when calculating your OR) you don't need to
worry about including the -cluster()- option because of the assumption
of random mating.  However, as I've hopefully demonstrated above, the
notion of risk being associated with an allele as opposed to a
genotype does not make biological sense and I think you'd be better
off investigating the risk associated with genotypes and determining
the mode of inheritance.

Here are a few references that will hopefully be useful...

Author: Cordell, H. J.; Clayton, D. G.
Title: Genetic association studies.
Journal: Lancet
Date: 2005
Volume: 366
Number: 9491
Pages: 1121-31
[This paper is actually the third in an excellent Genetic Epidemiology
series in the Lancet, and I'd highly recommend the whole series if
you're embarking on work in the area]

Author: Lewis, C. M.
Title: Genetic association studies: design, analysis and interpretation.
Journal: Brief Bioinform
Date: 2002
Volume: 3
Number: 2
Pages: 146-53

Author: Cordell, H. J.; Clayton, D. G.
Title: A unified stepwise regression procedure for evaluating the
relative effects of polymorphisms within a gene using case/control or
family data: application to HLA in type 1 diabetes.
Journal: Am J Hum Genet
Date: 2002
Volume: 70
Number: 1
Pages: 124-41

Author: Sasieni, P. D.
Title: From genotypes to genes: doubling the sample size
Journal: Biometrics
Date: 1997
Volume: 53
Number: 4
Pages: 1253-1261

(There is of course a whole lot more out there to be found :-).


On 3/13/07, meena khan <> wrote:
Hi Neil,

I'm trying to do an allele-wise analysis, the genotype analysis was as you
said where the genotype is recoded. It is not quite a matched case control:
there is not a case for each control where the outcome alternates like this:

               x              outcome
1             1                  1
1             2                  0
2             1                  1
2             1                  0
3             1                  1
3             2                  0

In my dataset, the outcome remains the same for each ID row (as it is the
same person) they just have a 2 different alleles which are different for
each person which is what i am testing, so:

ID          SNP              outcome
1             1                  1
1             2                  1
2             1                  0
2             1                  0
3             1                  0
3             2                  0

I have slightly different 95% confidence intervals when i do normal logistic
regression and when i introduce the cluster (ID) option, so there is a
difference in the 2 methods. Do you know why this is and which is the better
model to use? Thanks


"Every great advance in natural knowledge has involved the absolute
rejection of authority."  - Thomas H. Huxley

Email - /
Website -
Photos -
*   For searches and help try:

© Copyright 1996–2023 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index