.- help for ^hapipf^ (STB-57: sbe38) .- Haplotype frequency using an EM algorithm and log-linear modelling ------------------------------------------------------------------ ^hapipf^ varlist [^using^ exp] [^if^ exp] [^,^ ^ldim(^varlist^)^ ^ipf(^str^)^ ^start^ ^dis^play ^known^ ^phase(^varname^)^ ^acc(^#^)^ ^ipfacc(^#^)^ ^nolog^ ^model(^#^)^ ^lrtest(^#,#^)^ ^convars(^str^)^ ^confile(^str^)^ ^mv^ ] Description ----------- This function calculates allele/haplotype frequencies using log-linear modelling embedded within an EM algorithm. The EM algorithm handles the phase uncertainty, and the log-linear modelling allows testing for linkage disequilibrium and disease association. These tests can be controlled for confounders using a stratified analysis specified by the log-linear model. The log-linear model can also model the relationship between loci and hence can group similar haplotypes. The log-linear model is fitted using iterative proportional fitting, which is implemented in the STB command ^ipf^ (the user will have to install this function first). This algorithm can handle very large contingency tables and converges to maximum likelihood estimates even when the likelihood is badly behaved. The varlist consists of paired variables representing the alleles at each locus. If phase is known, then the pairs are the genotypes. When phase is unknown the algorithm assumes Hardy Weinberg Equilibrium, so that models are based on chromosomal data and not genotypic data. This algorithm can handle missing alleles at the loci by using the ^mv^ option. Options ------- ^mv^ specifies that the algorithm should replace missing data (".") with a copy of each of the possible alleles at this locus. This is performed at the same stage as the handling of the missing phase when the dataset is expanded into all possible observations. If this option is not specified but some of the alleles do contain missing data, the algorithm sees the symbol "." as another allele. ^ldim(^varlist^)^ specifies the variables that determine the dimension of the contingency table. By default the variables contained in the ^ipf^ option define the dimension. ^ipf(^str^)^ specifies the log-linear model. It requires special syntax of the form ^l1*l2+l3^. ^l1*l2^ allows all the interactions between the first two loci, and locus 3 is independent of them. This syntax is used in most books on log-linear modelling. ^start^ specifies that the starting posterior weights of the EM algorithm are chosen at random. ^dis^play specifies whether the expected and imputed haplotype frequencies are shown on the screen. ^known^ specifies that phase is known. ^phase(^varname^)^ specifies a variable that contains 1's where phase is known and 0's where phase is unknown. ^acc(^#^)^ specifies the convergence criteria based on the log-likelihood. ^ipfacc(^#^)^ specifies the convergence criteria for the ipf algorithm. ^nolog^ specifies whether the log-likelihood is displayed at each iteration. ^model(^#^)^ specifies a label for the log-linear model being fitted. This label is used in the ^lrtest()^ option. ^lrtest(^#,#^)^ performs a likelihood ratio test using two models that have been labelled in the ^model()^ option. ^convars(^str^)^ specifies a list of variables in the constraints file. ^confile(^str^)^ specifies the name of the constraints file. Examples -------- Take a dataset with 3 loci. The pairs of alleles at locus 1 are the variables ass1 and ass2, the pairs of alleles at locus 2 are the variables bss1 and bss2 and the pairs of alleles at locus 3 are the variables drss1 and drss2. Note the ^ipf()^ option requires the log-linear model that contains lj, meaning locus j. The indicator variable for whether a person is case or a control is caco. To test whether the haplotypes are associated with disease is the likelihood ratio test comparing the models l1*l2*l3*caco and l1*l2*l3+caco. The following stata commands perform this test. ^. hapipf ass1 ass2 bss1 bss2 drss1 drss2, ipf(l1*l2*l3*caco)^ ^model(0) display^ ^. hapipf ass1 ass2 bss1 bss2 drss1 drss2, ipf(l1*l2*l3+caco)^ ^model(1) lrtest(0,1) display^ Author ------ Adrian Mander MRC-Biostatistics Unit, Institute of Public Health, Forvie Site, Cambridge, UK Phone: (0)1223 330393 Fax: (0)1223 330388 Email: adrian.mander@@mrc-bsu.cam.ac.uk