.- help for ^mfracpol^ (STB-49: sg81.1) .- Multivariable fractional polynomial models ------------------------------------------ ^mfracpol^ regression_cmd yvar xvarlist [^in^ range] [^if^ exp] [^weight^] [^,^ ^adj^ust^(^adj_list^)^ ^al^pha^(^alpha_list^)^ ^cat^zero^(^varlist^)^ ^cyc^les^(^#^)^ ^df(^df_list^)^ ^dfd^efault^(^#^)^ ^po^wers^(^numlist^)^ ^sel^ect^(^select_list^)^ ^seq^uential ^xo^rder^(+^|^-^|^n)^ ^xp^owers^(^xp_list^)^ ^zer^o^(^varlist^)^ regression_cmd_options ] where regression_cmd may be @cox@, @glm@, @logistic@, @logit@, @poisson@, or @regress@ and where adj_list is a comma-separated list with elements varlist^:^{^mean^|#|^no^} except that the first element may optionally be of the form {^mean^|#|^no^} to specify the default for all variables. ^mfracpol^ shares the features of all estimation commands; see help @est@. @fracplot@ may be used following ^mfracpol^ to show plots of fitted values and partial residuals. @fracpred@ may be used for prediction. All weight types supported by regression_cmd are allowed; see help @weights@. Description ----------- ^mfracpol^ selects the fractional polynomial (FP) model which best predicts the outcome variable, yvar, from the RHS variables, xvarlist. After execution, ^mfracpol^ leaves variables in the data named ^I^xvar^__1^, ^I^xvar^__2^, ..., where xvar represents the first four letters of the name of xvar1, and so on for xvar2, xvar3, ... the other members of xvarlist (if any). The new variables contain the best-fitting fractional polynomial powers of xvar1, xvar2, ... Options ------- ^adjust(^adj_list^)^ defines the adjustment for the covariates xvar1, xvar2, ..., xvarlist. The default is ^adjust(mean)^, except for binary covariates where it is ^adjust(^#^)^, # being the lower of the two distinct values of the covariate. A typical item in adj_list is varlist^:^{^mean^|#|^no^}. Items are separated by commas. The first item is special in that varlist^:^ is optional, and if omitted, the default is (re)set to the specified value (^mean^ or # or ^no^). For example, ^adjust(no, age:mean)^ sets the default to ^no^ and adjustment for ^age^ to ^mean^. ^alpha(^alpha_list^)^ sets the significance levels for testing between FP models of different degree. The rules for alpha_list are the same as for df_list in the ^df()^ option (see below). The default selection level (significance level, nominal P-value) is 0.05 for all variables. Example: ^alpha(0.01)^ [All variables have FP selection level 1%] Example: ^alpha(0.05, weight:0.1)^ [All variables except ^weight^ have FP selection level 5%, ^weight^ has level 10%] ^catzero(^varlist^)^: for details see "Zeroes and zero categories" under Remarks below. varlist must be a subset of xvarlist. ^cycles(^#^)^ is the maximum number of iteration cycles permitted. Default: 5. ^df(^df_list^)^ sets up the degrees of freedom (df) for each predictor. The df (not counting the regression constant, ^_cons^) are twice the degree of the FP, so for example an xvar fitted as a second-degree FP (m = 2) has 4 df. The first item in df_list may be either # or ^:^#. Subsequent items must be ^:^#. Items are separated by commas and is specified in the usual way for variables. With the first type of item, the df for all predictors are taken to be #. With the second type of item, all members of (which must be a subset of xvarlist) have # df. The default df for a predictor (specified in xvarlist but not in df_list) are assigned according to the number of distinct (unique) values of the predictor as follows: # of distinct values default df ----------------------------------------- 1 (invalid) 2-3 1 4-5 min(2,^dfdefault()^) >=6 ^dfdefault()^ ----------------------------------------- Example: ^df(4)^ [All variables have 4 df.] Example: ^df(2, weight displ:4)^ [^weight^ and ^displ^ have 4 df, all other variables have 2 df.] Example: ^df(weight displ:4, mpg:2)^ [^weight^ and ^displ^ have 4 df, ^mpg^ has 2 df, all other variables have the default of 1 df.] Example: ^df(weight displ:4, 2)^ [All variables have 2 df since the final 2 overrides the earlier 4.] ^dfdefault(^#^)^ determines the default maximum degrees of freedom (df) for a predictor. Default # is 4 (second degree FP). ^powers(^numlist^)^ is the set of fractional polynomial powers to be used. The default set is -2,-1,-0.5,0,0.5,1,2,3 (0 means log). ^select(^select_list^)^ sets the significance levels for variable selection. A variable is dropped if its removal causes a non-significant increase in deviance at the relevant selection level. The rules for select_list are the same as for df_list in the ^df()^ option (see above). Using the default selection level of 1 for all variables forces them all into the model. Setting selection level 1 for a given variable forces it into the model. Example: ^select(0.05)^ [All variables have selection level 5%] Example: ^select(0.05, weight:1)^ [All variables except ^weight^ have selection level 5%, ^weight^ is forced into the model] ^sequential^ chooses the sequential FP selection algorithm (see Remarks). ^xorder(+^|^-^|^n)^ determines the order of entry of the covariates into the model selection algorithm. The default is ^xorder(+)^ which enters them in decreasing order of significance in a multiple linear regression (most significant first). ^xorder(-)^ places them in reverse significance order whereas ^xorder(n)^ respects the original order in xvarlist. ^xpowers(^xp_list^)^ sets fractional polynomial powers for covariates individually. The rules for xp_list are the same as for df_list in the ^df()^ option. The default selection is as for the ^powers()^ option. Example: ^xpowers(-1 0 1)^ [All variables have powers -1,0,1] Example: ^xpowers(x5:-1 0 1)^ [All variables except ^x5^ have default powers, x5 has powers -1,0,1] ^zero(^varlist^)^ treats negative and zero values of members of varlist as zero when FP transformations are applied. By default, such variables are subjected to a preliminary linear transformation to avoid negative and zero values (see @fracpoly@). varlist must be a subset of xvarlist. graph_options are any of the standard Stata ^graph, twoway^ options. regression_cmd_options may be any of the options appropriate to regression_cmd, such as ^dead(^deadvar^)^ for ^cox^. Remarks on ^mfracpol^ ------------------- Estimation algorithm ==================== The estimation algorithm in ^mfracpol^ processes the xvars in turn. Initially, ^mfracpol^ silently arranges xvarlist in order of increasing P-value (i.e. of decreasing statistical significance) for omitting each predictor from the model comprising xvarlist with each term linear. The aim is to model relatively important variables before unimportant ones. This may help to reduce potential model-fitting difficulties caused by collinearity or, more generally, `concurvity' among the predictors. (See the ^xorder()^ option for details of how to change the ordering.) At the initial cycle, the best-fitting FP function for xvar1 (the first of xvarlist) is determined, with all the other variables assumed linear. Either the default or the alternative procedure is used (see Methods of FP Model Selection, below). The functional form (but NOT the estimated regression coefficients) for xvar1 is kept, and the process is repeated for xvar2, xvar3, etc. The first iteration concludes when all the variables have been processed in this way. The next cycle is similar, except that the functional forms from the initial cycle are retained for all variables excepting the one currently being processed. A variable whose functional form is prespecified to be linear (i.e. to have 1 df) is tested only for exclusion within the above procedure when its nominal P-value (selection level) according to ^select()^ is less than 1. Updating of FP functions and candidate variables continues until the functions and variables included in the overall model do not change (convergence). Convergence is usually achieved within 1-4 cycles. Methods of FP Model Selection ============================= ^mfracpol^ includes two algorithms for FP model selection. Both of them are forms of backward elimination. They start from a most complex permitted FP model and attempt to simplify it by reducing the degree. The default algorithm is inspired by the so-called "closed test procedure", a sequence of tests in each of which the "familywise error rate" or P-value is maintained at a prespecified nominal value such as 0.05. The "closed test" algorithm for choosing an FP model with maximum permitted degree m=2 (i.e. 4 df) for a single continuous predictor, x, is as follows: 1. Inclusion: test the FP in x for possible omission of x (4 df test, significance level determined by ^select()^). If x is significant, continue, otherwise drop x from the model. 2. Non-linearity: test the FP in x against a straight line in x (3 df test, significance level determined by ^alpha()^). If significant, continue, otherwise the chosen model is a straight line. 3. Simplification: test the FP with m=2 against the best FP with m=1 (2 df test at ^alpha()^ level). If significant, choose m=2, otherwise choose m=1. The first step is omitted if x is to be retained in the model, i.e. if its selection level, according to the ^select()^ option, is 1. All significance tests are carried out using an approximate P-value calculation based on a difference in deviances (-2 x log likelihood) having a chi-squared or F distribution, depending on the regression in use. Therefore, each of the tests in the procedure maintains a significance level only approximately equal to ^select()^. The algorithm is thus not truly a closed procedure. However, for a given significance level it does provide some protection against over-fitting, that is against choosing over-complex FP models. The alternative algorithm which is used if the ^sequential^ option is specified, is as follows: 1. Test the FP with m=2 against the best FP with m=1 (2 df test, ^alpha()^ level). If significant, choose m=2, otherwise continue. 2. Test the FP with m=1 against a straight line (1 df test, ^alpha()^ level). If significant, choose m=1, otherwise continue. 3. Test a straight line in x against omitting x (1 df test, ^select()^ level). If significant, choose a straight line, otherwise drop x from the model. The final step is omitted if x is to be retained in the model, i.e. if its selection level, according to the ^select()^ option, is 1. If x is uninfluential, the overall Type I error rate of this procedure is much greater than that of the "closed" procedure, which is close to the nominal value. On the other hand, its power to detect an m=2 relationship is typically higher. Zeroes and zero categories ========================== The ^zero()^ option permits fitting an FP model to the positive values of a covariate, taking non-positive values as zero. An application is the assessment of the effect of cigarette smoking as a risk factor in an epidemiological study. Non-smokers may be qualitatively different from smokers, so the effect of smoking (regarded as a continuous variable) may not be continuous between one and zero cigarettes. To allow for this the risk may be modelled as constant for the non-smokers and an FP function of the number of cigarettes for the smokers: . ^gen byte nonsmokr = cond(n_cigs==0, 1, 0) if n_cigs != .^ . ^mfracpol logit case n_cigs nonsmokr age, zero(n_cigs) df(4,nonsmokr:1)^ Omission of ^zero(n_cigs)^ would cause ^n_cigs^ to be transformed before analysis by the addition of a suitable constant, probably 1. A closely related approach involves the ^catzero()^ option. The command . ^mfracpol logit case n_cigs age, catzero(n_cigs)^ would achieve a similar result to the previous command, but with important differences. First, ^mfracpol^ would create the equivalent of the binary variable ^nonsmokr^ (now called ^n_cigs_0^) automatically and include it in the model. Second, the two smoking variables (^n_cigs^ and ^n_cigs_0^) would be linked and treated as a single predictor in the model. With the ^select^ option for variable selection active, they would be tested simultaneously for inclusion in the model. A degree of freedom would be allocated to ^n_cigs_0^ in addition to those allowed for fractional polynomial transformation of ^n_cigs^. Examples -------- . ^mfracpol regress mpg weight displ foreign^ . ^mfracpol regress mpg weight displ foreign, df(1, weight displ:4)^ . ^mfracpol regress mpg weight displ foreign, df(2, foreign:1)^ ^select(0.05, foreign:1) alpha(0.1)^ . ^fracplot displ^ Authors ------- Patrick Royston Imperial College School of Medicine, UK p.royston@@ic.ac.uk Gareth Ambler Imperial College School of Medicine, UK g.ambler@@ic.ac.uk Also see -------- STB: STB-49 sg81.1, STB-43 sg81 Manual: [R] ^fracpoly^ On-line: help for @clogit@, @cox@, @fit@, @glm@, @logistic@, @logit@, @poisson@, @regress@, @xtgee@.