{smcl} {* *! version 1.0.0 30apr2008}{...} {cmd:help haplologit} {hline} {title:Title} {p2colset 5 19 18 0}{...} {p2col :{hi:haplologit} {hline 2}}Haplotype-effects logistic regression for case-control data{p_end} {p2colreset}{...} {title:Syntax} {p 8 43 2} {opt haplologit} {depvar} [{indepvars}] {ifin} {cmd:,} {bind:{cmdab:snp:vars:(}{varlist}{cmd:)} [{it:options}]} {synoptset 32 tabbed}{...} {marker options}{...} {synopthdr} {synoptline} {syntab:Model} {p2coldent :* {opth snp:vars(varlist)}}specify SNP variables{p_end} {synopt :{opth inher:itance(haplologit##inheritance:inhmode)}}specify mode of inheritance; default is {cmd:inheritance(additive)}{p_end} {synopt :{opth riskhap:(haplologit##riskhap_spec:riskhap_spec)}}specify a single risk haplotype{p_end} {synopt :{cmd:riskhap}{it:#}{cmd:(}{help haplologit##riskhap_spec:{it:riskhap_spec}}{cmd:)}}specify {it:#}th risk haplotype{p_end} {synopt :{opt hft:hreshold(#)}}retain observations with initial haplotype frequencies exceeding {cmd:hfthreshold()}; default is {bind:max(2/N,0.001)}{p_end} {synopt :{cmdab:const:raints(}{it:{help estimation options##constraints():numlist}}{cmd:)}}apply specified linear constraints on environmental factors {it:indepvars}{p_end} {synopt:{opt col:linear}}keep collinear variables{p_end} {synopt :{opt nocon:stant}}suppress constant term{p_end} {syntab:Reporting} {synopt :{opt l:evel(#)}}set confidence level; default is {cmd:level(95)}{p_end} {synopt :{opt or}}report odds ratios{p_end} {synopt :{opt happrefix(string)}}use {it:string} as a prefix when labeling haplotypes in the output; default is {cmd:happrefix(hap_)}{p_end} {synopt :{opt alldot:s}}show all iterations (except ml) as dots{p_end} {synopt :{opt nocoe:f}}suppress coefficients table{p_end} {synopt :{opt nofre:q}}suppress haplotype-frequency table{p_end} {synopt :{opt nohead:er}}suppress output header{p_end} {syntab:EM options} {synopt :{opt emsamp:le}{cmd:(}{opt co:ntrols|}{opt ca:ses}|{opt al:l}{cmd:)}}obtain initial haplotype frequencies from the specified sample; default is {cmd:emsample(controls)}{p_end} {synopt :{opt emiter:ate(#)}}number of EM iterations; default is 500{p_end} {synopt :{opt emtol:erance(#)}}EM convergence tolerance; default is 1e-6{p_end} {synopt :{opt eminit(matname)}}specify matrix containing starting values of haplotype frequencies for EM estimation{p_end} {synopt :{opt sort}}sort haplotypes by frequencies in the EM haplotype-frequency table; default is to sort by haplotypes (in a binary order){p_end} {synopt :{opt emlog}}show EM iteration log{p_end} {synopt :{opt emdot:s}}show EM iterations as dots{p_end} {synopt :{opt noemshow}}suppress output from EM estimation{p_end} {synopt :{opt noemt:able}}suppress EM haplotype-frequency table{p_end} {syntab:Max options} {synopt :{it:{help haplologit##maximize_options:maximize_options}}}control the maximization process; seldom used{p_end} {synoptline} {p 4 6 2} * {opt snpvars(varlist)} is required.{p_end} {synoptset 23}{...} {marker inheritance}{...} {synopthdr :inhmode} {synoptline} {synopt :{opt a:dditive}}additive mode of inheritance; the default{p_end} {synopt :{opt d:ominant}}dominant mode of inheritance{p_end} {synopt :{opt r:ecessive}}recessive mode of inheritance{p_end} {synoptline} {marker riskhap_spec}{...} {phang} where {it:riskhap_spec} is {pmore2} {it:riskhap_str}|{it:#} [, {it:riskhap_suboptions}] {pmore} {it:riskhap_str} specifies the binary representation of a risk haplotype enclosed in quotes or {it:#} specifies a risk haplotype index (position of a risk haplotype in the ordered sequence of 2^M possible haplotypes at M SNP sites). {synoptset 23}{...} {synopthdr :riskhap_suboptions} {synoptline} {synopt :{opth inter:action(varlist)}}specify interaction variables{p_end} {synopt :{opt nocon:stant}}suppress constant term; seldom used{p_end} {synoptline} {p2colreset}{...} {title:Description} {pstd} {cmd:haplologit} estimates haplotype effects and haplotype-environment interactions from case-control genetic (SNP-based) data for one of three types of genetic (haplotype risk) models: additive, dominant, or recessive. It fits haplotype-effects logistic regression using the retrospective profile-likelihood method in a special case of a rare disease and a single candidate gene in Hardy-Weinberg equilibrium, under the assumption of gene-environment independence. {cmd:haplologit} handles phased, unphased, and missing genotypes and allows specifying multiple risk haplotypes. {title:Options} {dlgtab:Model} {phang} {opt snpvars(varlist)} is required; it specifies SNP variables (variables recording subjects' SNP genotypes). The SNP variables must contain values of 0, 1, 2, or missing (.). A missing value (.) indicates missing information at a SNP site, other values represent the number of copies of a mutant (minor) allele at a SNP site in a subject's pair of homologous chromosomes. {phang} {opt inheritance(inhmode)} specifies a mode of inheritance (a genetic model). The default is the {cmd:additive} risk model in which having two copies of a risk haplotype in a pair of homologous chromosomes results in a two-fold effect of the risk haplotype on a disease. The {cmd:dominant} risk model assumes that having one or two copies of a risk haplotype has the same effect on a disease. The {cmd:recessive} model assumes that only having two copies of a risk haplotype has an effect on a disease. {phang} {opt riskhap(riskhap_spec)} requests to include effects of the specified risk haplotype in a regression model. {cmd:riskhap()} is a synonym for {cmd:riskhap1()}. {phang} {cmd:riskhap}{it:#}{cmd:(}{it:riskhap_spec}{cmd:)} requests to include effects of the {it:#}th risk haplotype in a regression model. If {opt interaction(varlist)} is specified in {cmd:riskhap}{it:#}{cmd:()} the respective interaction effects of the risk haplotype with the covariates specified in {it:varlist} are included (in addition to main effect) in the regression model. If {cmd:noconstant} is used, the main haplotype effect is omitted and only haplotype-environment interaction effects are included in the model (seldom used). {phang2} {opt interaction(varlist)} specifies variables to be interacted with the specified risk haplotype. {phang2} {cmd:noconstant} requests that the constant term (the main effect of a risk haplotype) is not included in the model (seldom used). This option requires {cmd:interaction()}. {phang} {opt hfthreshold(#)} specifies to retain in the computation only subjects' diplotypes with initial frequencies of constituent haplotypes exceeding {it:#}. The default is {bind:max(2/N,0.001)} where N is the total number of cases and controls. {phang} {opth constraints(numlist)}, {opt collinear}; see {helpb estimation options:[R] estimation options}. {cmd:constraints()} may only be used to define linear constraints on environmental covariates {it:indepvars}. {phang} {opt noconstant} suppresses the constant (intercept) term; seldom used. {dlgtab:Reporting} {phang} {opt level(#)}; see {helpb estimation options##level():[R] estimation options}. {phang} {opt or} reports the estimated coefficients transformed to odds ratios, i.e., exp(b) rather than b. Standard errors and confidence intervals are similarly transformed. This option affects how results are displayed, not how they are estimated. {cmd:or} may be specified at estimation or when replaying previously estimated results. {phang} {opt happrefix(string)} uses the specified {it:string} as a prefix when labeling haplotypes in the output (except for the EM output). The default prefix is {cmd:hap_}. {phang} {opt alldots} specifies that iterations from all (possibly time-consuming) computations be shown as dots except for the ml iterations. {cmd:alldots} implies {cmd:emdots}. {phang} {opt nocoef} specifies that the coefficient table not be displayed. {phang} {opt nofreq} specifies that the haplotype-frequency table not be displayed. {phang} {opt noheader} suppresses the output header, either at estimation or upon replay. {dlgtab:EM options} {phang} {cmd:emsample(controls|cases|all)} requests that the initial haplotype frequencies are estimated from the control sample, case sample, or combined case-control sample. The default is to use the control sample. {phang} {opt emiterate(#)} specifies the number of EM iterations to perform. The default is 500. {phang} {opt emtolerance(#)} specifies the convergence tolerance for the EM algorithm. The default is 1e-6. The EM algorithm terminates when the maximum relative change in estimated haplotype frequencies between successive iterations is less than {it:#}. {phang} {opt eminit(matname)} specifies the 1xL matrix {it:matname} containing starting values of haplotype frequencies for EM estimation. If M is the number of SNP loci (SNP variables), then {bind:L=2^M-1}. By default, all haplotypes are assumed to be equally likely, that is all haplotype frequencies are set to 1/2^M. {phang} {opt sort} requests that haplotypes are displayed in descending order of frequencies in the EM haplotype-frequency table. By default, haplotypes are displayed according to their binary ordering. {phang} {opt emlog} specifies that the EM iteration log be shown. The EM iteration log is, by default, not displayed. {phang} {opt emdots} specifies that the EM iterations be shown as dots. This option can be convenient when the EM algorithm requires many iterations to converge. {phang} {opt noemshow} suppresses the output from the EM estimation. {phang} {opt noemtable} suppresses the EM haplotype-frequency table. {marker maximize_options}{...} {dlgtab:Max options} {phang} {it:maximize_options}: {opt dif:ficult}, {opt iter:ate(#)}, [{cmdab:no:}]{opt lo:g}, {opt tr:ace}, {opt hess:ian}, {opt grad:ient}, {opt showstep}, {opt tol:erance(#)}, {opt ltol:erance(#)}, {opt gtol:erance(#)}, {opt nrtol:erance(#)}, {opt nonrtol:erance}, {opt shownr:tolerance}; see {help maximize}. {pstd} By default convergence is declared when the {opt nrtolerance()} criterion and either of the {opt tolerance()} or {opt ltolerance()} criterion has been met. If {opt nonrtolerance} is specified, then convergence is declared when either of the {opt tolerance()} or {opt ltolerance()} criterion has been met. {pstd} If {opt gtolerance()} is specified, then the {opt gtolerance()} criterion must be met in addition to any other required criteria for convergence to be declared. See {help maximize##tolerance:Tolerance options} for more information.{p_end} {hline} {title:Remarks} {pstd} {cmd:haplologit} is designed for the analysis of haplotype-disease associations in case-control data. It utilizes the retrospective profile likelihood approach of Spinka et al. (2005) and Lin and Zeng (2006) to estimate haplotype and haplotype-environment effects. This approach is more efficient for the analysis of case-control data than the conventional prospective methods. The gain in efficiency is obtained from utilizing the available information about genetic distribution (Hardy-Weinberg equilibrium, independence of environmental factors) in the construction of the retrospective likelihood of the data. If only haplotype main effects are specified (no environmental effects), no profiling of the likelihood is needed and the retrospective likelihood approach of Epstein and Satten (2003) is used. {pstd} Subjects' genetic information is described by a sequence of single nucleotide polymorphisms (SNPs), known to be genetic markers, along the gene of interest. Specifically, this information consists of pairs of subjects' {it:SNP} {it:haplotype}s where a haplotype is a set of nearby SNPs on the same chromosome. In practice, the true haplotype pairs (diplotypes) are not directly observed. Instead, a {it:SNP} {it:genotype} (a combination of two homologous SNP haplotypes) is observed. The observed SNP genotype data is supplied to {cmd:haplologit} as subjects' genetic information via required option {cmd:snpvars()}. The genotype data must be recorded in the so-called SNP variables (one variable for each SNP locus), containing values of 0, 1, 2, and missing (.) only. A missing value (.) indicates missing information at a SNP site, other values record the number of copies of a mutant (minor) allele at a SNP site in a subject's pair of homologous chromosomes. Data must be in the wide form, that is a single observation per subject. {pstd} The effect of a single gene in HWE on the disease is considered. A risk haplotype (or causal haplotype) is a target haplotype whose effect on a disease is of interest. The effects of risk haplotypes can be modeled according to one of three genetic models, specified in option {cmd:inheritance()} as the mode of inheritance: additive (the default), dominant, or recessive. Genetic covariates are viewed as functions of subjects' SNP genotype data and risk haplotypes. Specifically, they depend on the number of copies of a risk haplotype present in the subject's diplotype. Their functional forms are determined by the selected genetic model. For example, under the additive risk model, having two copies of a risk haplotype in a subject's diplotype doubles the effect of this haplotype on a disease compared to having only one copy. In contrast, under the dominant risk model having one or two copies has the same effect on the disease. Under the recessive model, only having two copies of a risk haplotype has an effect on the disease. {cmd:haplologit} uses genetic covariates indirectly in the computation via the supplied information about SNP genotypes (option {cmd:snpvars()}), the genetic model (option {cmd:inheritance()}), and the risk haplotypes. {pstd} A risk haplotype may be specified as a string of a sequence of zeros and ones (binary representation) or as a haplotype index (position of a risk haplotype in the ordered sequence of 2^M possible haplotypes at M SNP sites). Risk haplotypes are specified in options {cmd:riskhap1()}, {cmd:riskhap2()}, and so on. By default, if no risk haplotypes are specified, {cmd:haplologit} uses the most frequent haplotype estimated from the control sample (or the sample specified in {cmd:emsample()}) as the risk haplotype. Environmental covariates may be specified as {it:indepvars} following the dependent variable {it:depvar} in the above syntax. The interaction effects of environmental factors and risk haplotypes may be included by using {cmd:riskhap{it:#}()}'s suboption {cmd:interaction()}. {pstd} The distributional assumptions on genetic covariates are Hardy-Weinberg equilibrium and independence with environmental covariates. Environmental covariates can be both continuous and discrete and their distribution is left unspecified. {pstd} {cmd:haplologit}'s estimation process consists of three stages: (1) data management, (2) initial estimation of haplotype frequencies, and (3) the estimation of haplotype and optionally environmental effects. During the data management stage {cmd:haplologit} performs data manipulations necessary for handling unphased and missing SNP genotypes in the computation. At the second stage the initial haplotype frequencies are estimated from the sample specified in {cmd:emsample()} by using the EM algorithm. Only the haplotypes with the estimated initial frequencies exceeding a default threshold (or an alternate threshold specified in {cmd:hfthreshold()}) are retained for further estimation. This is necessary for numerical stability of the algorithm. At the third stage, the coefficients for environmental covariates, risk haplotypes, and their interactions are estimated simultaneously with the haplotype frequencies by Newton-Raphson. The command displays information from and optionally progress at each of the three steps. {pstd} The execution time of {cmd:haplologit} increases significantly with an increased number of SNP loci and haplotype and environmental effects. The presence of many subjects with missing and/or unphased genotypes increases the execution time as well. {pstd} For more details and methodology see Marchenko et al. (2008). {title:Examples} {pstd} The dataset {cmd:cc_snp.dta} contains fictional data on 90 individuals (45 cases and 45 controls) genotyped at 2 SNP sites. The environmental covariates are {cmd:age} and {cmd:gender}. The disease indicator variable is {cmd:status}. Subjects' SNP genotypes are recorded in variables {cmd:snp1} and {cmd:snp2}. {p2colset 5 9 9 0}{...} {p2col: 1.}{it:Additive main effects of the most frequent haplotype.} Use {cmd:age} as the environmental covariate, {cmd:status} as the dependent variable, and 2 SNP variables {cmd:snp1} and {cmd:snp2}.{p_end} . {stata "use cc_snp.dta"} . {stata haplologit status age, snpvars(snp1 snp2)} {pmore} From the output, the estimated frequencies of the four haplotypes "00", "01", "10", and "11" from the control sample are 0.025, 0.386, 0.497, and 0.092, respectively. All haplotype frequencies exceed the default threshold of 0.022 and thus all haplotypes are used in the estimation. The regression includes main effects of {cmd:age} and the default risk haplotype "10" (the most frequent haplotype in the control sample). Although the obtained results are not statistically significant, the estimated effect (log odds ratio) of {cmd:age}, 0.019, suggests an increase in the risk of the disease with age, whereas the presence of a risk haplotype "10" (coefficient for {cmd:hap_10} is -0.552) decreases this risk. {pmore} The left part of the header of the coefficient table reports general information about the fitted model. This includes the type of the genetic model used in the computation (the default is additive), the assumed genotype distribution of the data (Hardy-Weinberg equilibrium), and variable names containing subjects' SNP genotypes ({cmd:snp1} and {cmd:snp2}). {pmore} The right part of the header provides general information about the data. {cmd:Number of obs} displays the number of observations (subjects) used in the computation. Subjects are determined by the rows of the dataset. In this example all observations in the dataset (90) are used in the computation. {cmd:haplologit} also reports the number of subjects with phased, unphased, and incomplete (missing in at least on SNP variable) genotypes. Among 90 subjects used in the computation, 49 have phased genotypes, 34 have unphased genotypes, and 7 have incomplete genotypes. {pmore} Under the rare-disease assumption the intercept parameter of the logistic model (b0) is unidentifiable (see, for example, Spinka et al. (2005)); it is confounded with the unknown marginal probability of a disease for a population Pr(D=1)=p1. {cmd:haplologit} reports the estimate of the "retrospective" constant term in the output, {cmd:_cons}. If p1 is known, the estimate of the intercept b0 may be retrieved from the formula {bind:{cmd:_cons}=b0+ln(N1/N0)-ln(p1/p0)} where N1 is the number of cases, N0 is the number of controls, and {bind:p0=1-p1}. {p2col:2a.}{it:Additive main effects of 01.} Here we investigate the main effect of haplotype 01. We specify the risk haplotype "01" in option {cmd:riskhap()}.{p_end} . {stata haplologit status age, snpvars(snp1 snp2) riskhap("01")} {pmore} Alternatively, we can specify the haplotype index 2 instead of "01" in {cmd:riskhap()} and obtain the same results.{p_end} . {stata haplologit status age, snpvars(snp1 snp2) riskhap(2)} {p2col:2b.}{it:Reporting odds ratios.} We can obtain the above results displayed as odds ratios by using option {cmd:or} on replay.{p_end} . {stata haplologit, or} {p2col: 3.}{it:Additive main and interaction effects of 01 with {cmd:age}.} In this example, following the syntax of {help haplologit##riskhap_spec:{it:riskhap_spec}}, we specify {cmd:inter(age)} in option {cmd:riskhap()} to include the interaction effect of risk haplotype 01 with age.{p_end} . {stata haplologit status age, snpvars(snp1 snp2) riskhap("01", inter(age))} {p2col: 4.}{it:Dominant main and interaction effects of 01 with {cmd:age}.} Here we change the mode of inheritance from the default additive to dominant by using option {cmd:inheritance(dominant)}. In fact, we use its abbreviated version {cmd:inher(d)}.{p_end} . {stata haplologit status age, snpvars(snp1 snp2) riskhap("01", inter(age)) inher(d)} {p2col: 5. }{it:Joint additive main effects of 01 and 10.} We specify risk haplotypes 01 and 10 in options {cmd:riskhap1()} and {cmd:riskhap2()}, respectively.{p_end} . {stata haplologit status age, snpvars(snp1 snp2) riskhap1("01") riskhap2("10")} {p2col: 6. }{it:Joint additive main and interaction effects of 01 and 10 with} {it:covariates} {cmd:age} {it:and} {cmd:gender}. We add variable {cmd:gender} to the list of independent variables and specify {cmd:inter(age gender)} in options {cmd:riskhap1()} and {cmd:riskhap2()} to include their interaction effects with the risk haplotypes.{p_end} {p 8 38 2}. {stata haplologit status age gender, snp(snp1 snp2) riskhap1("01", inter(age gender)) riskhap2("10", inter(age gender))}{p_end} {p2colreset}{...} {title:Saved results} {pstd} {cmd:haplologit} saves the following in {cmd:e()}: {synoptset 15 tabbed}{...} {p2col 5 15 19 2: Scalars}{p_end} {synopt:{cmd:e(N)}}number of observations (subjects used in the computation) {p_end} {synopt:{cmd:e(N_phased)}}number of subjects with phased genotypes{p_end} {synopt:{cmd:e(N_unphased)}}number of subjects with unphased genotypes{p_end} {synopt:{cmd:e(N_miss)}}number of subjects with incomplete genotypes (missing in at least one SNP variable){p_end} {synopt:{cmd:e(ll)}}retrospective (profile) log-likelihood{p_end} {synopt:{cmd:e(converged)}}{cmd:1} if converged, {cmd:0} otherwise{p_end} {synopt:{cmd:e(df_m)}}model degrees of freedom{p_end} {synopt:{cmd:e(chi2)}}chi-squared{p_end} {synopt:{cmd:e(p)}}significance of model test{p_end} {synopt:{cmd:e(em_N)}}number of observations at EM stage{p_end} {synopt:{cmd:e(em_ll)}}EM log-likelihood{p_end} {synopt:{cmd:e(cutoff)}}haplotype-frequency threshold{p_end} {synopt:{cmd:e(rc)}}return code{p_end} {synoptset 15 tabbed}{...} {p2col 5 15 19 2: Macros}{p_end} {synopt:{cmd:e(cmd)}}{cmd:haplologit}{p_end} {synopt:{cmd:e(cmdline)}}command as typed{p_end} {synopt:{cmd:e(depvar)}}name of dependent variable{p_end} {synopt:{cmd:e(snpvars)}}names of SNP variables{p_end} {synopt:{cmd:e(inheritance)}}mode of inheritance{p_end} {synopt:{cmd:e(genepop)}}genetic distribution{p_end} {synopt:{cmd:e(emsample)}}a sample used to obtain initial haplotype frequencies{p_end} {synopt:{cmd:e(happrefix)}}prefix used to label haplotypes in the output{p_end} {synoptset 15 tabbed}{...} {p2col 5 15 19 2: Matrices}{p_end} {synopt:{cmd:e(b)}}coefficient vector{p_end} {synopt:{cmd:e(V)}}variance-covariance matrix of the estimators{p_end} {synopt:{cmd:e(em_freq)}}initial haplotype frequency vector{p_end} {synoptset 15 tabbed}{...} {p2col 5 15 19 2: Functions}{p_end} {synopt:{cmd:e(sample)}}marks estimation sample{p_end} {p2colreset}{...} {pstd} For other results see {manhelp maximize R}. {title:References} {pstd} Epstein, M. P., and G. A. Satten. 2003. Inference on haplotype effects in case-control studies using unphased genotype data. {it:American} {it:Journal} {it:of} {it:Human} {it:Genetics} 73: 1316-1329. {pstd} Lin, D. Y., and D. Zeng. 2006. Likelihood-based inference on haplotype effects in genetic association studies (with discussion). {it:Journal} {it:of} {it:the} {it:American} {it:Statistical} {it:Association} 101: 89-118. {pstd} Marchenko, Y. V., Carroll, R. J., Lin, D. Y., Amos, C. I. , and R. G. Gutierrez. 2008. Semiparametric analysis of case-control genetic data in the presence of environmental factors. {it:Stata Journal} ??: ??-??. {pstd} Spinka, C., Carroll, R. J., and Chatterjee, N. 2005. Analysis of case-control studies of genetic and environmental factors with missing genetic information and haplotype-phase ambiguity. {it:Genetic} {it:Epidemiology} 29: 108-127.