{smcl}
{* *! version 1.0.0  30apr2008}{...}
{cmd:help haplologit} 
{hline}

{title:Title}

{p2colset 5 19 18 0}{...}
{p2col :{hi:haplologit} {hline 2}}Haplotype-effects logistic regression for case-control data{p_end}
{p2colreset}{...}


{title:Syntax}

{p 8 43 2}
{opt haplologit} {depvar} [{indepvars}] {ifin} {cmd:,} 
{bind:{cmdab:snp:vars:(}{varlist}{cmd:)} [{it:options}]}

{synoptset 32 tabbed}{...}
{marker options}{...}
{synopthdr}
{synoptline}
{syntab:Model}
{p2coldent :* {opth snp:vars(varlist)}}specify SNP variables{p_end}
{synopt :{opth inher:itance(haplologit##inheritance:inhmode)}}specify mode of
inheritance; default is {cmd:inheritance(additive)}{p_end}
{synopt :{opth riskhap:(haplologit##riskhap_spec:riskhap_spec)}}specify a 
single risk haplotype{p_end}
{synopt :{cmd:riskhap}{it:#}{cmd:(}{help haplologit##riskhap_spec:{it:riskhap_spec}}{cmd:)}}specify {it:#}th risk haplotype{p_end}
{synopt :{opt hft:hreshold(#)}}retain observations with initial haplotype
frequencies exceeding {cmd:hfthreshold()}; default is 
{bind:max(2/N,0.001)}{p_end}
{synopt :{cmdab:const:raints(}{it:{help estimation options##constraints():numlist}}{cmd:)}}apply specified linear constraints on environmental factors {it:indepvars}{p_end}
{synopt:{opt col:linear}}keep collinear variables{p_end}
{synopt :{opt nocon:stant}}suppress constant term{p_end}


{syntab:Reporting}
{synopt :{opt l:evel(#)}}set confidence level; default is
{cmd:level(95)}{p_end}
{synopt :{opt or}}report odds ratios{p_end}
{synopt :{opt happrefix(string)}}use {it:string} as a prefix when labeling haplotypes in the output; default is {cmd:happrefix(hap_)}{p_end}
{synopt :{opt alldot:s}}show all iterations (except ml) as dots{p_end}
{synopt :{opt nocoe:f}}suppress coefficients table{p_end}
{synopt :{opt nofre:q}}suppress haplotype-frequency table{p_end}
{synopt :{opt nohead:er}}suppress output header{p_end}

{syntab:EM options}
{synopt :{opt emsamp:le}{cmd:(}{opt co:ntrols|}{opt ca:ses}|{opt al:l}{cmd:)}}obtain initial haplotype frequencies from the specified sample; default is {cmd:emsample(controls)}{p_end}
{synopt :{opt emiter:ate(#)}}number of EM iterations; default is 500{p_end}
{synopt :{opt emtol:erance(#)}}EM convergence tolerance; default is 1e-6{p_end}
{synopt :{opt eminit(matname)}}specify matrix containing starting values of haplotype frequencies for EM estimation{p_end}
{synopt :{opt sort}}sort haplotypes by frequencies in the EM haplotype-frequency table; default is to sort by haplotypes (in a binary order){p_end}
{synopt :{opt emlog}}show EM iteration log{p_end}
{synopt :{opt emdot:s}}show EM iterations as dots{p_end}
{synopt :{opt noemshow}}suppress output from EM estimation{p_end}
{synopt :{opt noemt:able}}suppress EM haplotype-frequency table{p_end}

{syntab:Max options}
{synopt :{it:{help haplologit##maximize_options:maximize_options}}}control the maximization process; seldom 
used{p_end}
{synoptline}
{p 4 6 2}
* {opt snpvars(varlist)} is required.{p_end}

{synoptset 23}{...}
{marker inheritance}{...}
{synopthdr :inhmode}
{synoptline}
{synopt :{opt a:dditive}}additive mode of inheritance; the default{p_end}
{synopt :{opt d:ominant}}dominant mode of inheritance{p_end}
{synopt :{opt r:ecessive}}recessive mode of inheritance{p_end}
{synoptline}

{marker riskhap_spec}{...}
{phang}
where {it:riskhap_spec} is 

{pmore2}
{it:riskhap_str}|{it:#} [, {it:riskhap_suboptions}]

{pmore}
{it:riskhap_str} specifies the binary representation of a risk haplotype
enclosed in quotes or {it:#} specifies a risk haplotype index (position
of a risk haplotype in the ordered sequence of 2^M possible haplotypes at
M SNP sites).

{synoptset 23}{...}
{synopthdr :riskhap_suboptions}
{synoptline}
{synopt :{opth inter:action(varlist)}}specify interaction variables{p_end}
{synopt :{opt nocon:stant}}suppress constant term; seldom used{p_end}
{synoptline}

{p2colreset}{...}


{title:Description}

{pstd}
{cmd:haplologit} estimates haplotype effects and haplotype-environment
interactions from case-control genetic (SNP-based) data for one of three types
of genetic (haplotype risk) models: additive, dominant, or recessive.  It fits
haplotype-effects logistic regression using the retrospective
profile-likelihood method in a special case of a rare disease and a single
candidate gene in Hardy-Weinberg equilibrium, under the assumption of
gene-environment independence.  {cmd:haplologit} handles phased, unphased, and
missing genotypes and allows specifying multiple risk haplotypes.


{title:Options}

{dlgtab:Model}

{phang}
{opt snpvars(varlist)} is required; it specifies SNP variables (variables
recording subjects' SNP genotypes). The SNP variables must contain values of
0, 1, 2, or missing (.).  A missing value (.) indicates missing information at
a SNP site, other values represent the number of copies of a mutant (minor)
allele at a SNP site in a subject's pair of homologous chromosomes.

{phang}
{opt inheritance(inhmode)} specifies a mode of inheritance (a genetic model).
The default is the {cmd:additive} risk model in which having two copies of a
risk haplotype in a pair of homologous chromosomes results in a two-fold
effect of the risk haplotype on a disease.  The {cmd:dominant} risk model
assumes that having one or two copies of a risk haplotype has the same effect
on a disease.  The {cmd:recessive} model assumes that only having two copies
of a risk haplotype has an effect on a disease.

{phang}
{opt riskhap(riskhap_spec)} requests to include effects of the specified risk
haplotype in a regression model.  {cmd:riskhap()} is a synonym for
{cmd:riskhap1()}.

{phang}
{cmd:riskhap}{it:#}{cmd:(}{it:riskhap_spec}{cmd:)} requests to include effects
of the {it:#}th risk haplotype in a regression model.  If {opt
interaction(varlist)} is specified in {cmd:riskhap}{it:#}{cmd:()} the
respective interaction effects of the risk haplotype with the covariates
specified in {it:varlist} are included (in addition to main effect) in the
regression model.  If {cmd:noconstant} is used, the main haplotype effect is
omitted and only haplotype-environment interaction effects are included in the
model (seldom used).

{phang2}
{opt interaction(varlist)} specifies variables to be interacted with the
specified risk haplotype.

{phang2}
{cmd:noconstant} requests that the constant term (the main effect of a risk
haplotype) is not included in the model (seldom used).  This option requires
{cmd:interaction()}.

{phang}
{opt hfthreshold(#)} specifies to retain in the computation only subjects'
diplotypes with initial frequencies of constituent haplotypes exceeding
{it:#}.  The default is {bind:max(2/N,0.001)} where N is the total number of
cases and controls.

{phang}
{opth constraints(numlist)},
{opt collinear}; see {helpb estimation options:[R] estimation options}.
{cmd:constraints()} may only be used to define linear constraints on
environmental covariates {it:indepvars}.

{phang}
{opt noconstant} suppresses the constant (intercept) term; seldom used.

{dlgtab:Reporting}

{phang}
{opt level(#)}; see {helpb estimation options##level():[R] estimation options}.

{phang}
{opt or} reports the estimated coefficients transformed to odds ratios, i.e.,
exp(b) rather than b.  Standard errors and confidence intervals are similarly
transformed.  This option affects how results are displayed, not how they are
estimated.  {cmd:or} may be specified at estimation or when replaying
previously estimated results.

{phang}
{opt happrefix(string)} uses the specified {it:string} as a prefix when
labeling haplotypes in the output (except for the EM output).  The default
prefix is {cmd:hap_}.

{phang}
{opt alldots} specifies that iterations from all (possibly time-consuming)
computations be shown as dots except for the ml iterations.  {cmd:alldots}
implies {cmd:emdots}.

{phang}
{opt nocoef} specifies that the coefficient table not be displayed.

{phang}
{opt nofreq} specifies that the haplotype-frequency table not be displayed.

{phang}
{opt noheader} suppresses the output header, either at estimation or upon
replay.

{dlgtab:EM options}

{phang}
{cmd:emsample(controls|cases|all)} requests that the initial haplotype
frequencies are estimated from the control sample, case sample, or combined
case-control sample.  The default is to use the control sample.

{phang}
{opt emiterate(#)} specifies the number of EM iterations to perform.  The
default is 500.

{phang}
{opt emtolerance(#)} specifies the convergence tolerance for the EM algorithm.
The default is 1e-6.  The EM algorithm terminates when the maximum relative
change in estimated haplotype frequencies between successive iterations is
less than {it:#}.

{phang}
{opt eminit(matname)} specifies the 1xL matrix {it:matname} containing
starting values of haplotype frequencies for EM estimation.  If M is the
number of SNP loci (SNP variables), then {bind:L=2^M-1}.  By default, all
haplotypes are assumed to be equally likely, that is all haplotype frequencies
are set to 1/2^M.

{phang}
{opt sort} requests that haplotypes are displayed in descending order of
frequencies in the EM haplotype-frequency table.  By default, haplotypes are
displayed according to their binary ordering.

{phang}
{opt emlog} specifies that the EM iteration log be shown.  The EM iteration
log is, by default, not displayed.

{phang}
{opt emdots} specifies that the EM iterations be shown as dots.  This option
can be convenient when the EM algorithm requires many iterations to converge.

{phang}
{opt noemshow} suppresses the output from the EM estimation.

{phang}
{opt noemtable} suppresses the EM haplotype-frequency table.

{marker maximize_options}{...}
{dlgtab:Max options}

{phang}
{it:maximize_options}:
{opt dif:ficult},
{opt iter:ate(#)},
[{cmdab:no:}]{opt lo:g},
{opt tr:ace},
{opt hess:ian},
{opt grad:ient},
{opt showstep},
{opt tol:erance(#)},
{opt ltol:erance(#)},
{opt gtol:erance(#)},
{opt nrtol:erance(#)},
{opt nonrtol:erance}, 
{opt shownr:tolerance};
see {help maximize}.

{pstd}
By default convergence is declared when the {opt nrtolerance()} criterion and
either of the {opt tolerance()} or {opt ltolerance()} criterion has been met.
If {opt nonrtolerance} is specified, then convergence is declared when either
of the {opt tolerance()} or {opt ltolerance()} criterion has been met.

{pstd}
If {opt gtolerance()} is specified, then the {opt gtolerance()} criterion must
be met in addition to any other required criteria for convergence to be
declared.  See {help maximize##tolerance:Tolerance options} for more
information.{p_end}

{hline}


{title:Remarks}

{pstd}
{cmd:haplologit} is designed for the analysis of haplotype-disease
associations in case-control data.  It utilizes the retrospective profile
likelihood approach of Spinka et al. (2005) and Lin and Zeng (2006) to
estimate haplotype and haplotype-environment effects.  This approach is more
efficient for the analysis of case-control data than the conventional
prospective methods.  The gain in efficiency is obtained from utilizing the
available information about genetic distribution (Hardy-Weinberg equilibrium,
independence of environmental factors) in the construction of the
retrospective likelihood of the data.  If only haplotype main effects are
specified (no environmental effects), no profiling of the likelihood is needed
and the retrospective likelihood approach of Epstein and Satten (2003) is
used.

{pstd}
Subjects' genetic information is described by a sequence of single nucleotide
polymorphisms (SNPs), known to be genetic markers, along the gene of interest.
Specifically, this information consists of pairs of subjects' {it:SNP}
{it:haplotype}s where a haplotype is a set of nearby SNPs on the same
chromosome.  In practice, the true haplotype pairs (diplotypes) are not
directly observed.  Instead, a {it:SNP} {it:genotype} (a combination of two
homologous SNP haplotypes) is observed.  The observed SNP genotype data is
supplied to {cmd:haplologit} as subjects' genetic information via required
option {cmd:snpvars()}.  The genotype data must be recorded in the so-called
SNP variables (one variable for each SNP locus), containing values of 0, 1, 2,
and missing (.) only.  A missing value (.)  indicates missing information at a
SNP site, other values record the number of copies of a mutant (minor) allele
at a SNP site in a subject's pair of homologous chromosomes.  Data must be in
the wide form, that is a single observation per subject.

{pstd}
The effect of a single gene in HWE on the disease is considered.  A risk
haplotype (or causal haplotype) is a target haplotype whose effect on a
disease is of interest.  The effects of risk haplotypes can be modeled
according to one of three genetic models, specified in option
{cmd:inheritance()} as the mode of inheritance: additive (the default),
dominant, or recessive.  Genetic covariates are viewed as functions of
subjects' SNP genotype data and risk haplotypes.  Specifically, they depend on
the number of copies of a risk haplotype present in the subject's diplotype.
Their functional forms are determined by the selected genetic model.  For
example, under the additive risk model, having two copies of a risk haplotype
in a subject's diplotype doubles the effect of this haplotype on a disease
compared to having only one copy.  In contrast, under the dominant risk model
having one or two copies has the same effect on the disease.  Under the
recessive model, only having two copies of a risk haplotype has an effect on
the disease.  {cmd:haplologit} uses genetic covariates indirectly in the
computation via the supplied information about SNP genotypes (option
{cmd:snpvars()}), the genetic model (option {cmd:inheritance()}), and the risk
haplotypes.

{pstd}
A risk haplotype may be specified as a string of a sequence of zeros and ones
(binary representation) or as a haplotype index (position of a risk haplotype
in the ordered sequence of 2^M possible haplotypes at M SNP sites).  Risk
haplotypes are specified in options {cmd:riskhap1()}, {cmd:riskhap2()}, and so
on.  By default, if no risk haplotypes are specified, {cmd:haplologit} uses
the most frequent haplotype estimated from the control sample (or the sample
specified in {cmd:emsample()}) as the risk haplotype.  Environmental
covariates may be specified as {it:indepvars} following the dependent variable
{it:depvar} in the above syntax.  The interaction effects of environmental
factors and risk haplotypes may be included by using {cmd:riskhap{it:#}()}'s
suboption {cmd:interaction()}.

{pstd}
The distributional assumptions on genetic covariates are Hardy-Weinberg
equilibrium and independence with environmental covariates.  Environmental
covariates can be both continuous and discrete and their distribution is left
unspecified.

{pstd}
{cmd:haplologit}'s estimation process consists of three stages: (1) data
management, (2) initial estimation of haplotype frequencies, and (3) the
estimation of haplotype and optionally environmental effects.  During the data
management stage {cmd:haplologit} performs data manipulations necessary for
handling unphased and missing SNP genotypes in the computation.  At the second
stage the initial haplotype frequencies are estimated from the sample
specified in {cmd:emsample()} by using the EM algorithm.  Only the haplotypes
with the estimated initial frequencies exceeding a default threshold (or an
alternate threshold specified in {cmd:hfthreshold()}) are retained for further
estimation.  This is necessary for numerical stability of the algorithm.  At
the third stage, the coefficients for environmental covariates, risk
haplotypes, and their interactions are estimated simultaneously with the
haplotype frequencies by Newton-Raphson.  The command displays information
from and optionally progress at each of the three steps.

{pstd}
The execution time of {cmd:haplologit} increases significantly with an
increased number of SNP loci and haplotype and environmental effects.  The
presence of many subjects with missing and/or unphased genotypes increases the
execution time as well.

{pstd}
For more details and methodology see Marchenko et al. (2008).


{title:Examples}

{pstd}
The dataset {cmd:cc_snp.dta} contains fictional data on 90 individuals (45
cases and 45 controls) genotyped at 2 SNP sites.  The environmental covariates
are {cmd:age} and {cmd:gender}.  The disease indicator variable is
{cmd:status}.  Subjects' SNP genotypes are recorded in variables {cmd:snp1}
and {cmd:snp2}.

{p2colset 5 9 9 0}{...}

{p2col: 1.}{it:Additive main effects of the most frequent haplotype.} Use
{cmd:age} as the environmental covariate, {cmd:status} as the dependent
variable, and 2 SNP variables {cmd:snp1} and {cmd:snp2}.{p_end}

	. {stata "use cc_snp.dta"}
	. {stata haplologit status age, snpvars(snp1 snp2)}

{pmore}
From the output, the estimated frequencies of the four haplotypes "00", "01",
"10", and "11" from the control sample are 0.025, 0.386, 0.497, and 0.092,
respectively.  All haplotype frequencies exceed the default threshold of 0.022
and thus all haplotypes are used in the estimation.  The regression includes
main effects of {cmd:age} and the default risk haplotype "10" (the most
frequent haplotype in the control sample).  Although the obtained results are
not statistically significant, the estimated effect (log odds ratio) of
{cmd:age}, 0.019, suggests an increase in the risk of the disease with age,
whereas the presence of a risk haplotype "10" (coefficient for {cmd:hap_10} is
-0.552) decreases this risk.

{pmore}
The left part of the header of the coefficient table reports general
information about the fitted model.  This includes the type of the genetic
model used in the computation (the default is additive), the assumed genotype
distribution of the data (Hardy-Weinberg equilibrium), and variable names
containing subjects' SNP genotypes ({cmd:snp1} and {cmd:snp2}).

{pmore}
The right part of the header provides general information about the data.
{cmd:Number of obs} displays the number of observations (subjects) used in the
computation. Subjects are determined by the rows of the dataset.  In this
example all observations in the dataset (90) are used in the computation.
{cmd:haplologit} also reports the number of subjects with phased, unphased,
and incomplete (missing in at least on SNP variable) genotypes. Among 90
subjects used in the computation, 49 have phased genotypes, 34 have unphased
genotypes, and 7 have incomplete genotypes.

{pmore}
Under the rare-disease assumption the intercept parameter of the logistic
model (b0) is unidentifiable (see, for example, Spinka et al. (2005)); it is
confounded with the unknown marginal probability of a disease for a population
Pr(D=1)=p1.  {cmd:haplologit} reports the estimate of the "retrospective"
constant term in the output, {cmd:_cons}.  If p1 is known, the estimate of the
intercept b0 may be retrieved from the formula
{bind:{cmd:_cons}=b0+ln(N1/N0)-ln(p1/p0)} where N1 is the number of cases, N0
is the number of controls, and {bind:p0=1-p1}.

{p2col:2a.}{it:Additive main effects of 01.} Here we investigate the main
effect of haplotype 01.  We specify the risk haplotype "01" in option
{cmd:riskhap()}.{p_end}

        . {stata haplologit status age, snpvars(snp1 snp2) riskhap("01")}

{pmore}
Alternatively, we can specify the haplotype index 2 instead of "01" in
{cmd:riskhap()} and obtain the same results.{p_end}
	
	. {stata haplologit status age, snpvars(snp1 snp2) riskhap(2)}

{p2col:2b.}{it:Reporting odds ratios.} We can obtain the above results
displayed as odds ratios by using option {cmd:or} on replay.{p_end}

        . {stata haplologit, or}

{p2col: 3.}{it:Additive main and interaction effects of 01 with {cmd:age}.} In
this example, following the syntax of 
{help haplologit##riskhap_spec:{it:riskhap_spec}}, we specify {cmd:inter(age)}
in option {cmd:riskhap()} to include the interaction effect of risk haplotype
01 with age.{p_end}

	. {stata haplologit status age, snpvars(snp1 snp2) riskhap("01", inter(age))}

{p2col: 4.}{it:Dominant main and interaction effects of 01 with {cmd:age}.}
Here we change the mode of inheritance from the default additive to dominant
by using option {cmd:inheritance(dominant)}.  In fact, we use its abbreviated
version {cmd:inher(d)}.{p_end}

	. {stata haplologit status age, snpvars(snp1 snp2) riskhap("01", inter(age)) inher(d)}

{p2col: 5. }{it:Joint additive main effects of 01 and 10.} We specify risk
haplotypes 01 and 10 in options {cmd:riskhap1()} and {cmd:riskhap2()},
respectively.{p_end}

	. {stata haplologit status age, snpvars(snp1 snp2) riskhap1("01") riskhap2("10")}

{p2col: 6. }{it:Joint additive main and interaction effects of 01 and 10 with}
{it:covariates} {cmd:age} {it:and} {cmd:gender}.  We add variable {cmd:gender}
to the list of independent variables and specify {cmd:inter(age gender)} in
options {cmd:riskhap1()} and {cmd:riskhap2()} to include their interaction
effects with the risk haplotypes.{p_end}

	{p 8 38 2}. {stata haplologit status age gender, snp(snp1 snp2) riskhap1("01", inter(age gender)) riskhap2("10", inter(age gender))}{p_end}
{p2colreset}{...}


{title:Saved results}

{pstd}
{cmd:haplologit} saves the following in {cmd:e()}:

{synoptset 15 tabbed}{...}
{p2col 5 15 19 2: Scalars}{p_end}
{synopt:{cmd:e(N)}}number of observations (subjects used in the computation)
{p_end}
{synopt:{cmd:e(N_phased)}}number of subjects with phased genotypes{p_end}
{synopt:{cmd:e(N_unphased)}}number of subjects with unphased genotypes{p_end}
{synopt:{cmd:e(N_miss)}}number of subjects with incomplete genotypes (missing in at least one SNP variable){p_end}
{synopt:{cmd:e(ll)}}retrospective (profile) log-likelihood{p_end}
{synopt:{cmd:e(converged)}}{cmd:1} if converged, {cmd:0} otherwise{p_end}
{synopt:{cmd:e(df_m)}}model degrees of freedom{p_end}
{synopt:{cmd:e(chi2)}}chi-squared{p_end}
{synopt:{cmd:e(p)}}significance of model test{p_end}
{synopt:{cmd:e(em_N)}}number of observations at EM stage{p_end}
{synopt:{cmd:e(em_ll)}}EM log-likelihood{p_end}
{synopt:{cmd:e(cutoff)}}haplotype-frequency threshold{p_end}
{synopt:{cmd:e(rc)}}return code{p_end}

{synoptset 15 tabbed}{...}
{p2col 5 15 19 2: Macros}{p_end}
{synopt:{cmd:e(cmd)}}{cmd:haplologit}{p_end}
{synopt:{cmd:e(cmdline)}}command as typed{p_end}
{synopt:{cmd:e(depvar)}}name of dependent variable{p_end}
{synopt:{cmd:e(snpvars)}}names of SNP variables{p_end}
{synopt:{cmd:e(inheritance)}}mode of inheritance{p_end}
{synopt:{cmd:e(genepop)}}genetic distribution{p_end}
{synopt:{cmd:e(emsample)}}a sample used to obtain initial haplotype frequencies{p_end}
{synopt:{cmd:e(happrefix)}}prefix used to label haplotypes in the output{p_end}

{synoptset 15 tabbed}{...}
{p2col 5 15 19 2: Matrices}{p_end}
{synopt:{cmd:e(b)}}coefficient vector{p_end}
{synopt:{cmd:e(V)}}variance-covariance matrix of the estimators{p_end}
{synopt:{cmd:e(em_freq)}}initial haplotype frequency vector{p_end}

{synoptset 15 tabbed}{...}
{p2col 5 15 19 2: Functions}{p_end}
{synopt:{cmd:e(sample)}}marks estimation sample{p_end}
{p2colreset}{...}

{pstd}
For other results see {manhelp maximize R}.


{title:References}

{pstd}
Epstein, M. P., and G. A. Satten. 2003. Inference on haplotype effects in
case-control studies using unphased genotype data. {it:American} {it:Journal}
{it:of} {it:Human} {it:Genetics} 73: 1316-1329.

{pstd}
Lin, D. Y., and D. Zeng. 2006. Likelihood-based inference on haplotype effects
in genetic association studies (with discussion). {it:Journal} {it:of}
{it:the} {it:American} {it:Statistical} {it:Association} 101: 89-118.

{pstd}
Marchenko, Y. V., Carroll, R. J., Lin, D. Y., Amos, C. I. , and R. G. Gutierrez.
2008. Semiparametric analysis of case-control genetic data in the presence of
environmental factors. {it:Stata Journal} ??: ??-??.

{pstd}
Spinka, C., Carroll, R. J., and Chatterjee, N. 2005. Analysis of case-control
studies of genetic and environmental factors with missing genetic information
and haplotype-phase ambiguity. {it:Genetic} {it:Epidemiology} 29: 108-127.