.-
help for ^gam^                                               (STB-42: sg79)
.-

Generalized additive models
---------------------------

    ^gam^ yvar xvars [weight] [^if^ exp] [^in^ range] [^, f^amily^(^familyname^)^
          ^l^ink^(^linkname^)^ ^df(^dflist^)^ [^no^]^cons^tant ^mi^ssing^(^string^)^
          ^de^ad^(^deadvar^)^ ^big^ ]

where familyname is one of 

	^gau^ssian  |  ^b^inomial   |  ^p^oisson  |  ^gam^ma   |   ^c^ox

and linkname is one of

	^ide^ntity  |  ^log^  |  ^l^ogit    |  ^inv^erse |   ^c^ox.


^gam^ without arguments or options redisplays results from the most recent
command.

^aweights^ and ^fweights^ are allowed.


Description
-----------

^gam^ fits a generalized or proportional hazards additive model (GAM) for yvar
as a function of xvars by mazimizing a penalized log likelihood function. The
smoothness of the resulting estimated function of xvars is determined by the
`equivalent degrees of freedom' specified in the ^df()^ option.


See Hastie and Tibshirani (1990) for full details and examples of GAMs.


Options
-------

^family(^familyname^)^ specifies the distribution of yvar; ^family(gaussian)^ is
    the default.

^link(^linkname^)^ specifies the link function.  The default for each family are
    the canonical links: ^ide^ntity for familyname ^gau^ss, ^l^ogit for ^b^inom,
    ^log^ for ^p^oisson, ^inv^erse for ^gam^ma and (by convention) ^c^ox for ^c^ox.

^df(^dflist^)^ sets up the df for each predictor. The df may be fractional. An
    item in dflist may be either # or <varlist>^:^#. Items are separated by
    commas. <varlist> is specified in the usual way for variables. With the
    first type of item, the df for all predictors are taken to be #. With the
    second type of item, all members of <varlist> (which must be a subset of
    xvars) have # df. If an item of the second type follows one of the first
    type, the later # overrides the earlier # for each variable in <varlist>.

    Example: ^df(3)^.                        [All variables have 3 df.]

    Example: ^df(weight displ:4, mpg:2)^.    [^weight^ and ^displ^ have 4 df,
                                            ^mpg^ has 2 df, all other variables
                                            have the default of 1 df.]

    Example: ^df(3, weight displ:4)^.        [^weight^ and ^displ^ have 4 df,
                                            all other variables have 3 df.]
    
    Example: ^df(weight displ:4, 3)^.        [All variables have 3 df, since
                                            the final # overrides the earlier.]
    
    Default: 1 df for all predictors.

^dead()^ only applies to Cox regression.  deadvar is the censoring variable
    (0 for censored, 1 for `dead').

^big^ asks for the large-problem version of gam, ^gambig.exe^ (see Problem
    size below).

^noconstant^ specifies that the model shall not have an intercept.

^missing(^#^)^ defines the missing value code seen by GAMFIT to be #, which must
    be a number. Default: 9999.
 

Remarks
-------

^gam^ creates the necessary input files for use by a version of the Fortran
program GAMFIT, written by Trevor Hastie & Robert Tibshirani, and runs the
program. The files are stored in the current Stata data directory. A single
output file called ^$.sum^ (containing linear coefficients and standard errors
for the model) is left behind in the current directory. This enables ^gam^ to
redisplay the results.

Note that ^gam^ omits any records containing missing values of yvar, xvars
and (for familytype = ^cox^) deadvar. Also, it sorts the data in the order
yvar xvars.

For each predictor with df > 1, ^gam^ reports a statistic called the 'Gain',
which is the difference in normalized deviance between the GAM and a model with
a linear term for that predictor. A large gain indicates a lot of nonlinearity,
at least as regards statistical significance. The associated ^p^ value is based
on a chi-square approximation to the distribution of the gain if the true
marginal relationship between that term and yvar was linear. It should be
regarded only as impressionistic as the statistical inference is approximate.

Note that the software may not provide exactly the number of df that was asked
for. The achieved df is shown in the table of results.

For Gaussian models and gamma models, the deviance is unscaled (i.e. a residual
sum of squares for Gaussian models).

We also note that (non-binary) predictors are standardized before analysis. As
a result the estimate and standard error of the intercept will differ from
those produced using Stata commands such as ^logit^, ^cox^ and ^regress^.


New variables created
---------------------

^gam^ creates a new variable ^GAM_mu^ containing the fitted values on the
scale of the response variable. For xvars with more than 2 distinct values,
^gam^ creates three other variables, as follows:

	^s_^<xvarname> 	smooth for <xvarname>
	^e_^<xvarname>	pointwise standard errors of smooth for <xvarname>
	^r_^<xvarname>	partial residuals for <xvarname>

where <xvarname> denotes an xvar.

Each smooth has mean zero. A pointwise 95% confidence band for each smooth may
be calculated by adding +/- 1.96 times its standard error to each smooth.


FORTRAN program and path considerations
---------------------------------------

By default, ^gam^ tries to run the program ^c:\ado\gamfit.exe^. You can
direct it to a different place by specifying the latter in the global macro
^$GAMDIR^. Example: suppose ^gamfit.exe^ was in directory d:\myprogs. You would
type

 . ^global GAMDIR d:\myprogs\^

just once in each Stata run, before using ^gam^. To restore the default,
you would type ^global GAMDIR^.


Problem size
------------

The largest problem (i.e. data+model) that can be fit is 70000 single-precision
real numbers (floats) in the standard version of gam (^gamfit.exe^) and one
million floats in the big version (^gambig.exe^). These quantities represent the
amount of storage space needed by the Fortran program, not the amount of data
stored in Stata. The problem size is approximated by the following formula:

	floats = 1000 * N * (#V^^0.2)/25

where N is the number of observations in the problem and #V is the total number
of variables, including the constant and deadvar if a Cox model is fit. For
example for a model with a constant and a single predictor (i.e. #V = 2) the
biggest problems that can be fit are N = 1523 and N = 21764 for the standard
and big versions respectively.


Warning
-------

We cannot vouch for the results from the Fortran software GAMFIT and have
occasionally noticed anomalies. However we believe it to be reliable in the
vast majority of instances. GAMFIT can fail to converge with ^cox^ regression
and can occasionally cause Stata to shut down without warning. We find that
this problem can usually be cured by changing the values of ^df()^ slightly.


Examples
--------

   . ^use auto^
   . ^gam mpg weight displ, df(weight:3, displ:4)^
   . ^gam foreign mpg, family(binomial) df(3)^
   . ^xi: gam foreign mpg i.rep78, family(binomial) df(mpg:2)^


Stored quantities
-----------------

The linear coefficient for each xvar and its standard error are stored in
global macros of the form ^S_^<xvarname> and ^E_^<xvarname> respectively. For
the regression constant (^_cons^) these are called ^S__cons^ and ^E__cons^.


Reference
---------

Hastie, T. J. and R. Tibshirani. 1990. Generalized Additive Models. 
London: Chapman and Hall.


Authors
-------

        Patrick Royston
        Royal Postgraduate Medical School, UK
        email: proyston@@rpms.ac.uk

        Gareth Ambler
        Royal Postgraduate Medical School, UK
        email: gambler@@rpms.ac.uk

See also
--------

    STB:  STB-42 sg79
On-line:  help for @glm@.