*! Date : 17 July 2000 (STB-57: sbe38) *! Version : 1.23 *! Author : Adrian Mander *! Email : adrian.mander@mrc-bsu.cam.ac.uk ********************************************************************* * 8/11/99 Now I want to implement some log-linear modelling within the * hapfreq3.ado algorithm. Basically the expansion of the dataset * is still carried out but instead of a simple calculating * algorithm I shall use the log-linear approach hence any model * can be specified ************************************************************************************* * 15/11/99 About to add the LDIM as the varlist to calculate the likelihood over. * This is important otherwise the algorithm will use the minimal continguency * table and when comparing tables you need the same dimension tables. The varlist * defines the table. Not sure about the missing data and cells :( ************************************************************************************* program define hapipf, rclass version 6.0 syntax varlist(min=2) [using/] [if] ,[ LDIM(varlist) IPF(string) START DISplay EXPect KNOWN PHASE(varname) ACC(real 0.001) IPFACC(real 0.000001) DEBUG(integer 0) NOLOG MODEL(integer 0) LRTEST(numlist) RARE(real 0) CONVARS(string) CONFILE(string) QUIET NOISE MV MVDEL] tokenize `varlist' cap which ipf if _rc~=0 { di in red "YOU MUST INSTALL ipf.ado which was first introduced in STB55" di "This function performs the log-linear modelling" exit } tempfile origin qui save `origin',replace if "`if'"~="" { keep `if'} global all_name = "`varlist'" global nloc 0 global nsub= _N *************************************************** * Delete all lines with missing marker information * for each marker in the varlist *************************************************** if "`mvdel'"~="" { di "You have selected the deletion of missing lines" di "-----------------------------------------------" while "`1'"~="" { qui count if `1'==. di "`r(N)' lines deleted due to variable `1'" qui drop if `1'==. mac shift 1 } } tokenize "`varlist'" ********************************************* * Pair the data from the varlist * and check that they are numeric ********************************************* local i 1 while "`1'"~="" { if "`2'"=="" { di in red "There must be paired data" } global nloc =$nloc+1 local wc1=2*$nloc-1 local wc2=`wc1'+1 local root1: word `wc1' of $all_name local root2: word `wc2' of $all_name cap confirm string variable `root1' if _rc==0 { di "`root1' cannot be a string variable" exit(7) } cap confirm string variable `root2' if _rc==0 { di "`root2' cannot be a string variable" exit(7) } if "`quiet'"=="" { di "Alleles at locus l$nloc are contained in variables (`root1' `root2')" } return local loci`i' = "`root1' `root2'" local i = `i'+1 mac shift 2 } di local nloc = $nloc if "`nloc'"~="1" { local haptext "Haplotype" local haptvar "Haplo" } else { local haptext "Allele" local haptvar "Allele" } ********************************** * Check out the constraint file ********************************** tempfile temp qui save `temp',replace if "`confile'"~="" { di di in blue "NB the degrees of freedom are wrong when using the constraint files" if "`convars'"=="" { di "You must specify the convars() option when using constraint files" } use `confile',replace tokenize "`convars'" cap confirm variable Efreqold if _rc~=0 { cap confirm variable Ifreq if _rc~=0 { di in red "Constraints file doesnt contain Efreqold or Ifreq!" exit(198) } else { rename Ifreq Efreqold } } sort `convars' save `confile',replace while "`1'"~="" { mac shift 1 } } ********************************************* * Make sure about including the ipf() option ********************************************* if "`ipf'"=="" { di in red "You must specify the loglinear model in the ipf() option " exit(198) } use `temp',replace ********************************************************************** * haplotyp expands the dataset into the phases * Takes the varlist and constructs the haplotypes * locus1 and locus2 * For phase unknown locus1 and locus2 contain all possible haplotypes ********************************************************************** * Additionally the user can specify that there are missing values * in the DNA typings with the MV option if "`known'"~="" & "`phase'"=="" { haplotyp, known `mv'} if "`known'"=="" & "`phase'"==""{ haplotyp, `mv'} if "`phase'"~="" { haplotyp,known phase(`phase') `mv'} *What is the next line? Probably gets rid of the chance of duplicates qui { drop `varlist' compress } *********************************************************** * Start frequencies f1 - future thing to do is to * implement a weighting system on the start points * * Note that if a subject has been expanded by phase then * the start frequencies are spread evenly by default *********************************************************** tempvar f1 tef1 if "`start'"=="" { sort subject qui by subject: gen double `f1'=1/_N } else { sort subject qui gen `f1'=uniform() qui by subject: gen double `tef1'= sum(`f1') qui by subject: replace `f1'= `f1'/`tef1'[_N] drop `tef1' } ********************************************* * EM algorithm ********************************************* local cont 1 local it 0 local saveimp 0 cap confirm new variable locus if _rc~=0 { di in red "Rename the variable locus" exit(111) } while(`cont'==1) { /*Start of while loop */ cap drop `pr' `prs' tempvar pr prs stupid plocus cap drop llh local it = `it'+1 ************************************************************ * # New estimate of haplotype relative frequencies ************************************************************ cap drop freq gen long `stupid' = _n /* this will preserve line numbers */ qui _stack locus1 locus2 qui rename locus1 locus qui drop locus2 * generate individual loci variables local i 1 local vlist " " qui gen str40 `plocus' = locus while `i'<=$nloc { tempvar len len2 qui gen `len' = index(`plocus',".")-1 qui gen l`i'=real(substr(`plocus',1,`len')) if `i'==$nloc { qui replace l`i'=real(`plocus') } else { qui gen `len2' = index(`plocus',".")+1 qui replace `plocus'=substr(`plocus',`len2',.) } drop `len' cap drop `len2' local vlist "`vlist' l`i'" local i=`i'+1 } if `debug'>0 { tab locus list l* if locus=="" } drop `plocus' ************************************************************ * Use the iterative proportional fitting algorithm and save * the expected frequencies in fit.dta * The variables present in the model are output into the * r(vlist) as this may contain more variables than loci ************************************************************ tempfile fit local file = substr("`fit'",1,index("`fit'",".")-1) if "`ldim'"=="" { if "`convars'"=="" { if "`noise'"=="" { qui ipf [fw=`f1'],exp fit(`ipf') save(`file') acc(`ipfacc') } else{ ipf [fw=`f1'],exp fit(`ipf') save(`file') acc(`ipfacc') } } else { qui ipf [fw=`f1'],exp convars(`convars') confile(`confile') fit(`ipf') save(`file') acc(`ipfacc') } local df=r(df) local nparms=r(parms) local ncells=r(ncells) sort `r(vlist)' global ipflist = "`r(vlist)'" merge `r(vlist)' using `file' /* merge in the new frequencies */ } else { qui ipf `ldim' [fw=`f1'], fit(`ipf') save(`file') acc(`ipfacc') local df=r(df) sort `ldim' global ipflist = "`ldim'" merge `ldim' using `file' /* merge in the new frequencies */ } ****************************************************** * Sometimes fit.dta has an unobserved line of ****************************************************** local rvlist "`r(vlist)'" qui count if locus=="" if `debug'>1 { l locus caco Efreq prob if locus=="" di "`r(N)' loci missing" } if `r(N)' > 0 { tempvar temprep gen `temprep' = locus=="" local i 1 while `i'<=$nloc { qui replace locus = cond( locus=="", locus + string(l`i'), locus +"."+ string(l`i')) if `temprep'==1 local i = `i'+1 } } if `debug'>2 { l locus caco Efreq prob } **************************************** * Saving the expected Frequencies. **************************************** if `saveimp'==1 { tempfile imputef now qui save `now' sort `rvlist' qui by `rvlist': keep if _n==1 qui save `imputef' use `now',clear } drop `vlist' Efreq Ofreq _merge sort _stack `stupid' qui _ustack locus prob, by(_stack) val(1 2) drop _stack merg _merge drop `stupid' cap drop npar qui gen npar = 1 local npar = npar[_N] ************************************************************* * # New probabilities (per phase and per subject) ************************************************************ qui gen double `pr' = prob1*prob2 drop prob1 prob2 ****************************************************************************** * # Calculate log likelihood * 1/11/99 I think in the next part within subject `by' shouldnt change * hence could ignore `by' UNLESS in each by strata individuals ids are * determined ****************************************************************************** cap confirm new variable llh llhh if _rc~= 0 { di in red "rename the variable below:" confirm new variable llh llhh } sort subject qui by subject : gen double `prs'=sum(`pr') qui by subject : gen double llh = log(`prs'[_N]) qui by subject : replace `pr' = `pr'/`prs'[_N] qui by subject :gen double llhh= cond(_n==_N,llh[_n],0) qui replace llhh = sum(llhh) local llh = llhh[_N] drop llhh if "`nolog'"=="" { di "Iteration `it' loglhd = `llh'" } ************************************** * Use the new weights for the ipf ************************************** qui replace `f1' = `pr' ************************************** * # Convergence test ************************************** if (`it'>1) { local cont = (`llh' - `lastllh')>`acc' if (`llh' < `lastllh') { di in red "likelihood not increased in hapipf" } } if `saveimp'==0 { if `cont'==0 { local saveimp 1 local cont 1 } } local lastllh = `llh' } /*end of while loop */ _stack locus1 locus2 `pr' `pr' rename locus1 locus ******************************************************************************** * In order to display the results I need the variables included * in the ipf model so that the expected frequencies are split * in groups of this as well ******************************************************************************** local i 1 local vlist "$ipflist" while `i'<=$nloc { local length= length("l`i'") local ind = index("`vlist'","l`i'") if `ind'>0 { local vlist = substr("`vlist'",1,`ind'-1)+substr("`vlist'",`ind'+`length',.) } local i=`i'+1 } sort locus `vlist' qui by locus `vlist' : gen double freq = sum(`pr') qui _unique subject qui gen double eprob=freq/(2*r(unique)) qui by locus `vlist' : keep if _n==_N sort `vlist' locus **************************************************************** * A routine to remove rare haplotypes but not zero haplotypes * and also remove the degrees of freedom **************************************************************** if `rare'~=0 { di "Removing rare haplotypes...." /* gen rare= eprob<`rare' count if rare==1 local nrare=r(N) local i 1 local slist "" qui gen str40 plocus = locus while `i'<=$nloc { qui gen len = index(plocus,".")-1 qui gen l`i'=real(substr(plocus,1,len)) if `i'==$nloc { qui replace l`i'=real(plocus) } else { qui gen len2 = index(plocus,".")+1 qui replace plocus=substr(plocus,len2,.) } drop len cap drop len2 local slist "`slist' l`i'" local i=`i'+1 } drop plocus gen Efreqold = cond(rare==1, 0 ,.) sort `slist' save constrain,replace */ gen rare= eprob<`rare' & eprob>0 count if rare==1 local nrare=r(N) drop if rare==1 list locus `vlist' freq eprob if rare==1, noobs } ********************************** * Display the testing expressions ********************************** if "`quiet'"=="" { di "" di "`haptext' Frequency Estimation by EM algorithm" di "----------------------------------------------" di " No. loci ", _col(20) "= $nloc" di " Log-Likelihood ", _col(20) "= `llh'" *di " Log-Likelihood under null ", _col(20) "= `llh0'" *di " 2*LogLikelihoodRatio ", _col(20) "=", 2*(`llh'-`llh0') di " Df ", _col(20) "=" , `df' di " No. parameters ", _col(20) "=" , `nparms' di " No. cells ", _col(20) "=" , `ncells' *di " No of rare parameters dropped ", _col(20) "=", `nrare' *di " p-value ", _col(20) "=", chiprob(`npar'-`npar0',2*(`llh'-`llh0')) di } ************************************************ * Should display the expected frequencies and * the imputed frequencies. ************************************************ if "`display'"~="" { sort locus `vlist' di in gr "Imputed Frequencies" qui gen str80 `haptvar'=locus qui compress `haptvar' list `haptvar' `vlist' freq eprob, noobs qui drop `haptvar' di in gr "Expected Frequencies" } qui use `imputef',clear sort locus `vlist' rename Efreq freq tempvar total gen double `total'=sum(freq) gen double eprob=freq/`total'[_N] sort locus `vlist' if "`display'"~="" { qui gen str80 `haptvar'=locus qui compress `haptvar' list `haptvar' `vlist' freq eprob, noobs qui drop `haptvar' * LOOK at the total frequency qui gen double totf = sum(freq) di di "TOTAL FREQ is ", totf[_N] } global loglik=`llh' keep locus `vlist' freq eprob if "`using'"~="" { * Must create the l1 l2 l3... variables from locus for the profile likelihood local i 1 qui gen str40 plocus = locus while `i'<=$nloc { qui gen len = index(plocus,".")-1 qui gen l`i'=real(substr(plocus,1,len)) if `i'==$nloc { qui replace l`i'=real(plocus) } else { qui gen len2 = index(plocus,".")+1 qui replace plocus=substr(plocus,len2,.) } drop len cap drop len2 local i=`i'+1 } drop plocus cap save `using', replace } **************************************************************** * Saving the model ipf string and the loglikelihood and * degrees of freedom **************************************************************** global hapmod`model'="`ipf'" global hapdf`model'="`df'" global hapllhd`model'="`llh'" if "`lrtest'"~="" { tokenize "`lrtest'" local m1 "hapmod`1'" local m2 "hapmod`2'" local l1 "hapllhd`1'" local l2 "hapllhd`2'" local d1 "hapdf`1'" local d2 "hapdf`2'" ****************************************** * Check whether these MACROS exist or not ****************************************** local error 0 if "$`l2'" == "" { di in red "GLOBAL: `l2' does not exist is there a model `2'?" local error 1 } if "$`l1'" == "" { di in red "GLOBAL: `l1' does not exist is there a model `1'?" local error 1 } if "$`m2'" == "" { di in red "GLOBAL: `m2' does not exist is there a model `2'?" local error 1 } if "$`m1'" == "" { di in red "GLOBAL: `m1' does not exist is there a model `1'?" local error 1 } if "$`d2'" == "" { di in red "GLOBAL: `d2' does not exist is there a model `2'?" local error 1 } if "$`d1'" == "" { di in red "GLOBAL: `d1' does not exist is there a model `1'?" local error 1 } if `error'~=1 { di di "Likelihood Ratio Test Comparing Model $`m2' to $`m1'" di "----------------------------------------------------" di " llhd2 (df2) =", $`l2', $`d2' di " llhd1 (df1) =", $`l1', $`d1' di di "-2*(llhd2-llhd1) =",-2*($`l2'-$`l1') di "Change in df = ", $`d2'-$`d1' local lrt = -2*($`l2'-$`l1') local chdf = $`d2'-$`d1' local pv=chiprob(`chdf',`lrt') di "p-value = ", chiprob(`chdf',`lrt') if `pv'<0.05 { di "Accept Model $`m1' at 5% significance level" } else { di "Accept Model $`m2' at 5% significance level" } return scalar lrtpv = `pv' return scalar lrtdf = `chdf' return scalar lrtchi = `lrt' } } use `origin',clear end ********************************************************* * Tabulate for both locus the expected frequency * The 3rd and fourth variables contain the counting * vectors ********************************************************* program define tabduo syntax varlist(min=2 max=100), [BY(string)] tokenize `varlist' if "`by'"=="" { stack `1' `3' `2' `4', into(temp one) clear sort temp qui by temp: gen freq=sum(one) qui by temp: keep if _n==_N } if "`by'"~="" { stack `1' `3' `by' `2' `4' `by', into(temp one `by') clear sort temp `by' qui by temp `by': gen freq=sum(one) qui by temp `by': keep if _n==_N } end ********************************************************* * Take the 2 haplotypes from each subject * and do some basic tabulating saving the resulting * tabulate in merge and merge2, these two represent * the different sorted variables ********************************************************** program define tabhap syntax varlist(min=2 max=100) tokenize `varlist' cap confirm new file merge.dta if _rc~=0 { di in red "merge.dta will be deleted within the program" } cap confirm new file merge2.dta if _rc~=0 { di in red "merge2.dta will be deleted within the program" } stack `1' `2', into(haplo) clear qui egen hapgrp = group(haplo) sort haplo qui by haplo: keep if _n==_N keep haplo hapgrp cap save merge,replace sort hapgrp cap save merge2,replace end ************************************************************************* * Calculate all the 2*($nloc-1) possible haplotypes * given the * observed data for each phase and person. * have locus1 and locus2 contain the string versions of the haplotypes * and the global unknown having the number of subjects with phase unknown ************************************************************************** * 8/11/99 - The strings are out and separate variables are in ************************************************************************** program define haplotyp syntax [varlist] [,NOISE KNOWN PHASE(string) MV] tokenize `varlist' cap confirm new variable locus1 locus2 subject if _rc~=0 { di in red "You must rename the following variable" confirm new variable locus1 locus2 subject } tempvar subj expand * NOTE that `subj' contains the actual subject ids and at the end it gets renamed * as subject qui gen long subject = _n qui gen long `subj' = _n if "`mv'"~="" { di "EXPANDING MISSING DATA....." local li 1 while `li'<=$nloc { local wc1=2*`li'-1 local wc2=`wc1'+1 local root1: word `wc1' of $all_name local root2: word `wc2' of $all_name qui count if `root1'==. local c1 = `r(N)' qui count if `root2'==. local cnt = `c1'+`r(N)' if `cnt'>0 { di "There are `cnt' missing values at locus `li'" qui tab `root1' , matrow(row) qui tab `root2' , matrow(col) * NEED to make one matrix with all the values at locus i called unique mat values = row \ col local j 1 while `j'<=rowsof(values) { if `j'==1 { mat unique = values[1,1] local j =`j'+1 } local not 0 local jj=1 while `jj'<`j' { if values[`j',1]==values[`jj',1] { local not 1 local jj=`j' } local jj=`jj'+1 } if `not'==0 { mat unique = unique \ values[`j',1] } local j = `j'+1 } *The matrix unique now contains just one copy of the alleles *Replace missing with one from each of unique *FOR TWO loci this would be the number of unique*(unique+1)/2 tempvar temp sort subject gen long `temp'=_n local missex = rowsof(unique) local missex = `missex'*(`missex'+1)/2 qui expand `missex' if `root1'==. | `root2'==. sort `temp' *Create the new phenotypes local lineno 1 local i 1 while `i'<=rowsof(unique) { local j = `i' while `j'<=rowsof(unique) { qui by `temp': replace `root1' = cond(`root1'==. & _n==`lineno', unique[`i',1],`root1') qui by `temp': replace `root2' = cond(`root2'==. & _n==`lineno', unique[`j',1],`root2') local lineno = `lineno'+1 local j = `j'+1 } local i =`i'+1 } } /* end of if dealing with missing */ local li =`li'+1 /*loop of loci */ } } di if "`known'"~="" & "`phase'"=="" { qui gen str40 locus1="" qui gen str40 locus2="" ******************************************* * Construct locus strings from alleles ******************************************* sort subject global unknown = 0 local i 1 while `i'<=$nloc { local wc1=2*`i'-1 local wc2=`wc1'+1 local root1: word `wc1' of $all_name local root2: word `wc2' of $all_name if `i'==1 { qui replace locus1 =string(`root1') qui replace locus2 =string(`root2') } else { qui replace locus1 = locus1+"."+string(`root1') qui replace locus2 = locus2+"."+string(`root2') } local i=`i'+1 } } if "`known'"=="" & "`phase'"=="" { local root1: word 1 of $all_name local root2: word 2 of $all_name qui gen str40 locus1=string(`root1') qui gen str40 locus2=string(`root2') qui gen `expand'=. ******************************************* * Construct locus strings from alleles and * also expand to all possibilities ******************************************* local i 2 while `i'<=$nloc { sort `subj' local wc1=2*`i'-1 local wc2=`wc1'+1 local root1: word `wc1' of $all_name local root2: word `wc2' of $all_name /* THIS BIT IS FOR NOT EXPANDING FOR WHOLE DATASET*/ qui replace `expand'= 2*(`root1'~=`root2') qui replace subject=_n qui expand `expand' sort subject qui by subject: replace locus1= locus1+"."+string(cond(_n==2,`root2',`root1')) qui by subject: replace locus2= locus2+"."+string(cond(_n==2,`root1',`root2')) local i=`i'+1 } qui drop subject `expand' qui rename `subj' subject } if "`phase'"~="" { local root1: word 1 of $all_name local root2: word 2 of $all_name qui gen str40 locus1=string(`root1') qui gen str40 locus2=string(`root2') qui gen `expand'=. cap confirm variable `phase' if _rc~=0 { di in red "variable `phase' does not exist! exit 111 } ******************************************* * Construct locus strings from alleles and * also expand to all possibilities ******************************************* local i 2 while `i'<=$nloc { sort `subj' local wc1=2*`i'-1 local wc2=`wc1'+1 local root1: word `wc1' of $all_name local root2: word `wc2' of $all_name /* THIS BIT IS FOR NOT EXPANDING FOR WHOLE DATASET*/ qui replace `expand'= cond(`phase'==0,2*(`root1'~=`root2'),1) qui replace subject=_n qui expand `expand' sort subject qui by subject: replace locus1= locus1+"."+string(cond(_n==2,`root2',`root1')) qui by subject: replace locus2= locus2+"."+string(cond(_n==2,`root1',`root2')) local i=`i'+1 } qui drop subject `expand' qui rename `subj' subject } qui compress end program define _stack syntax varlist(min=1) [using/] [if],[NOISE START KNOWN PHASE(string) BY(string) ACC(real 0.001) DEBUG] tokenize `varlist' gen _stack=1 tempfile stack qui save `stack' local i 1 while "``i''"~="" { local p=`i' local i = `i'+1 if "``p''"~="``i''" { drop ``p'' rename ``i'' ``p'' } local i=`i'+1 } qui replace _stack=2 append using `stack' end program define _ustack syntax varlist(min=1) [using/] [if],BY(string) VALues(string) [NOISE START KNOWN PHASE(string) ] tokenize `varlist' tokenize "`values'", parse(" ") tempfile first merge qui by `by': gen merg=_n save `first' keep if `by'==`1' tokenize "`varlist'" keep `varlist' merg while "`1'"~="" { rename `1' `1'1 mac shift 1 } sort merg save `merge' use `first' tokenize `values', parse(" ") keep if `by'==`2' tokenize `varlist' while "`1'"~="" { rename `1' `1'2 mac shift 1 } sort merg merge merg using `merge' end ******************************************* * This programs unique function ******************************************* program define _unique, rclass version 6.0 syntax varlist(min=1 max=1) tokenize "`varlist'" local var `1' preserve sort `var' qui by `var': keep if _n==1 return scalar N=_N global S_1 =_N qui drop if `var'==. global S_2 =_N return scalar unique=_N restore end