.- help for ^gam^ (STB-42: sg79) .- Generalized additive models --------------------------- ^gam^ yvar xvars [weight] [^if^ exp] [^in^ range] [^, f^amily^(^familyname^)^ ^l^ink^(^linkname^)^ ^df(^dflist^)^ [^no^]^cons^tant ^mi^ssing^(^string^)^ ^de^ad^(^deadvar^)^ ^big^ ] where familyname is one of ^gau^ssian | ^b^inomial | ^p^oisson | ^gam^ma | ^c^ox and linkname is one of ^ide^ntity | ^log^ | ^l^ogit | ^inv^erse | ^c^ox. ^gam^ without arguments or options redisplays results from the most recent command. ^aweights^ and ^fweights^ are allowed. Description ----------- ^gam^ fits a generalized or proportional hazards additive model (GAM) for yvar as a function of xvars by mazimizing a penalized log likelihood function. The smoothness of the resulting estimated function of xvars is determined by the `equivalent degrees of freedom' specified in the ^df()^ option. See Hastie and Tibshirani (1990) for full details and examples of GAMs. Options ------- ^family(^familyname^)^ specifies the distribution of yvar; ^family(gaussian)^ is the default. ^link(^linkname^)^ specifies the link function. The default for each family are the canonical links: ^ide^ntity for familyname ^gau^ss, ^l^ogit for ^b^inom, ^log^ for ^p^oisson, ^inv^erse for ^gam^ma and (by convention) ^c^ox for ^c^ox. ^df(^dflist^)^ sets up the df for each predictor. The df may be fractional. An item in dflist may be either # or ^:^#. Items are separated by commas. is specified in the usual way for variables. With the first type of item, the df for all predictors are taken to be #. With the second type of item, all members of (which must be a subset of xvars) have # df. If an item of the second type follows one of the first type, the later # overrides the earlier # for each variable in . Example: ^df(3)^. [All variables have 3 df.] Example: ^df(weight displ:4, mpg:2)^. [^weight^ and ^displ^ have 4 df, ^mpg^ has 2 df, all other variables have the default of 1 df.] Example: ^df(3, weight displ:4)^. [^weight^ and ^displ^ have 4 df, all other variables have 3 df.] Example: ^df(weight displ:4, 3)^. [All variables have 3 df, since the final # overrides the earlier.] Default: 1 df for all predictors. ^dead()^ only applies to Cox regression. deadvar is the censoring variable (0 for censored, 1 for `dead'). ^big^ asks for the large-problem version of gam, ^gambig.exe^ (see Problem size below). ^noconstant^ specifies that the model shall not have an intercept. ^missing(^#^)^ defines the missing value code seen by GAMFIT to be #, which must be a number. Default: 9999. Remarks ------- ^gam^ creates the necessary input files for use by a version of the Fortran program GAMFIT, written by Trevor Hastie & Robert Tibshirani, and runs the program. The files are stored in the current Stata data directory. A single output file called ^$.sum^ (containing linear coefficients and standard errors for the model) is left behind in the current directory. This enables ^gam^ to redisplay the results. Note that ^gam^ omits any records containing missing values of yvar, xvars and (for familytype = ^cox^) deadvar. Also, it sorts the data in the order yvar xvars. For each predictor with df > 1, ^gam^ reports a statistic called the 'Gain', which is the difference in normalized deviance between the GAM and a model with a linear term for that predictor. A large gain indicates a lot of nonlinearity, at least as regards statistical significance. The associated ^p^ value is based on a chi-square approximation to the distribution of the gain if the true marginal relationship between that term and yvar was linear. It should be regarded only as impressionistic as the statistical inference is approximate. Note that the software may not provide exactly the number of df that was asked for. The achieved df is shown in the table of results. For Gaussian models and gamma models, the deviance is unscaled (i.e. a residual sum of squares for Gaussian models). We also note that (non-binary) predictors are standardized before analysis. As a result the estimate and standard error of the intercept will differ from those produced using Stata commands such as ^logit^, ^cox^ and ^regress^. New variables created --------------------- ^gam^ creates a new variable ^GAM_mu^ containing the fitted values on the scale of the response variable. For xvars with more than 2 distinct values, ^gam^ creates three other variables, as follows: ^s_^ smooth for ^e_^ pointwise standard errors of smooth for ^r_^ partial residuals for where denotes an xvar. Each smooth has mean zero. A pointwise 95% confidence band for each smooth may be calculated by adding +/- 1.96 times its standard error to each smooth. FORTRAN program and path considerations --------------------------------------- By default, ^gam^ tries to run the program ^c:\ado\gamfit.exe^. You can direct it to a different place by specifying the latter in the global macro ^$GAMDIR^. Example: suppose ^gamfit.exe^ was in directory d:\myprogs. You would type . ^global GAMDIR d:\myprogs\^ just once in each Stata run, before using ^gam^. To restore the default, you would type ^global GAMDIR^. Problem size ------------ The largest problem (i.e. data+model) that can be fit is 70000 single-precision real numbers (floats) in the standard version of gam (^gamfit.exe^) and one million floats in the big version (^gambig.exe^). These quantities represent the amount of storage space needed by the Fortran program, not the amount of data stored in Stata. The problem size is approximated by the following formula: floats = 1000 * N * (#V^^0.2)/25 where N is the number of observations in the problem and #V is the total number of variables, including the constant and deadvar if a Cox model is fit. For example for a model with a constant and a single predictor (i.e. #V = 2) the biggest problems that can be fit are N = 1523 and N = 21764 for the standard and big versions respectively. Warning ------- We cannot vouch for the results from the Fortran software GAMFIT and have occasionally noticed anomalies. However we believe it to be reliable in the vast majority of instances. GAMFIT can fail to converge with ^cox^ regression and can occasionally cause Stata to shut down without warning. We find that this problem can usually be cured by changing the values of ^df()^ slightly. Examples -------- . ^use auto^ . ^gam mpg weight displ, df(weight:3, displ:4)^ . ^gam foreign mpg, family(binomial) df(3)^ . ^xi: gam foreign mpg i.rep78, family(binomial) df(mpg:2)^ Stored quantities ----------------- The linear coefficient for each xvar and its standard error are stored in global macros of the form ^S_^ and ^E_^ respectively. For the regression constant (^_cons^) these are called ^S__cons^ and ^E__cons^. Reference --------- Hastie, T. J. and R. Tibshirani. 1990. Generalized Additive Models. London: Chapman and Hall. Authors ------- Patrick Royston Royal Postgraduate Medical School, UK email: proyston@@rpms.ac.uk Gareth Ambler Royal Postgraduate Medical School, UK email: gambler@@rpms.ac.uk See also -------- STB: STB-42 sg79 On-line: help for @glm@.