# Re: st: RE: transformation of a continuous variable for a logisticregression model

 From Suzy To statalist@hsphsun2.harvard.edu Subject Re: st: RE: transformation of a continuous variable for a logisticregression model Date Wed, 19 Apr 2006 19:54:10 -0400

Nick - Not to beat a dead horse, but I just thought I'd share this with you - from:<>

Vincenzo Bagnardi, Antonella Zambon, Piero Quatto and Giovanni Corrao. Flexible Meta-Regression Functions for Modeling Aggregate Dose-Response Data, with an Application to Alcohol and Mortality. Am J Epidemiol 2004; 159:1077-1086.

"Although it is rather simple, the family of second-order fractional polynomial models offers considerably flexibility. In particular, by choosing p1 and p2 from a predefined set P = {–2, –1, –0.5, 0, 0.5, 1, 2, 3}, a very rich set of possible functions, including some so-called U-shaped and J-shaped relations, may be accommodated. The powers are expressed according to the Box-Tidwell transformation (12 <http://aje.oxfordjournals.org/cgi/content/full/159/11/1077#KWH142C12>), in which denotes if pi != 0 and log x if pi = 0. When p1 = p2 = p, the model becomes log(RR½x) = ß1xp + ß2(xp log x)."

I thought that a second order polynomial = "degree of 2" (M=2) = quadratic as shown in my output from fracpoly below (M=2). I had also e-mailed the fracplot to show the quadratic curve, but for some reason, it was deleted via transport. In any case, the age variable transformations (age_1 and age_2) from the fracgen command were calculated using the the formulas above - ß1age3 + ß2(age3 log age).

Thus, I still respectfully do not understand why the fracpoly and boxtid results are not consistent with this variable. As far as a theoretical justification of the functional form of age and the response variable - it does make sense for these data.

Nick Cox wrote:

Sorry, but this to me is just a restatement of your previous posting, and addresses none of the points I raised.
That aside,
I don't understand how a quadratic function can have powers 3 3. Cubics in my experience are never appropriate for global fits unless there are clear dimensional grounds for using them, which seems unlikely here.
Nick n.j.cox@durham.ac.uk
Suzy

Thanks for your response Nick. In a nutshell, age is not linear in the logit. I'm using the fracpoly command to identify the best functional form for age in the full model. The result returned from Fracpoly was a quadratic function with powers 3 3 (which also looks good with fracplot). However, when I further assessed the model using the Boxtid command, the results with the new age transformation - the results were not favorable (the Ho was rejected). When I transformed another continuous variable in the same full logistic model (quadratic with powers 1 2 by Fracpoly), the Boxtid results were favorable, all graphs looked very good, and the diagnostics were good (linktest, etc...). I'm trying to understand why my results aren't consistent (Fracpoly and Boxtid) with the age variable, but is with all other continuous variables?

Nick Cox wrote:

I am not clear what you think Statalist members know
that can help you here. For example, the field in which you are working, what the response variable -dmcat- means, and what other predictors there may be are all
hidden from view, so the chance of giving opinions drawing on substantive expertise is zero. Otherwise
put, you appear to be assuming that the choices
here can all be made on purely statistical criteria, an attitude which always worries me greatly.
What I have observed, as a kind of anthropologist of
statistical science, is that age plays very different
roles in different fields. Economists often seem to find that a quadratic in age does very nicely, whereas biostatisticians often seem to need more complicated representations, which seems
perfectly plausible given the complexities of
Either way, -fracpoly- like other programs has
no inbuilt sensor (or censor) selecting theoretically or scientifically sensible functional forms. So, I suggest that you plot the curve implied against
age and think about it as something that needs justification
or interpretation independently from the data.
Nick n.j.cox@durham.ac.uk
Suzy

I am trying to transform one final continuous independent variable (age) in a logistic regression model. I've tried what I know that's available via Stata. For example, I used the fracpoly command and the best transformation was a second order polynomial with powers 3 3.

Fractional polynomial model comparisons:
---------------------------------------------------------------
age df Deviance Gain P(term) Powers
---------------------------------------------------------------
Not in model 0 2098.129 -- --
Linear 1 1834.224 0.000 0.000 1
m = 1 2 1805.957 28.267 0.000 -1
m = 2 4 1791.327 42.897 0.001 3 3
m = 3 6 1790.526 43.699 0.670 -2 3 3
m = 4 8 1788.431 45.793 0.351 -2 -2 3 3
---------------------------------------------------------------

I then used fracgen to generate the new age variables - age_1 and age_2.

fracgen age 3 3
-> gen double age_1 = X^3 -> gen double age_2 = X^3*ln(X) (where: X = (age+1)/10)

The coefficients for age_1 and age_2 from the full logistic regression model:
--------------------------------------------------------------
----------------
Y var | Odds Ratio Std. Err. z P>|z|
[95% Conf.

Interval]
-------------+------------------------------------------------
----------------
age_1 | 1.087994 .0093302 9.83 0.000
1.06986

1.106436
age_2 | .9644247 .0037538 -9.31 0.000
.9570955

.9718101

However the boxtid command rejected the null for both age_1 and age_2....

age_1 | .0100805 .0007172 14.055 Nonlin. dev. 24.646 (P = 0.000)
p1 | .0535714 .2122906 0.252
--------------------------------------------------------------
----------------
age_2 | -.0021756 .0004885 -4.453 Nonlin. dev. 7.894 (P = 0.005)
p1 | 3.864227 2.133377 1.811

In all other respects, the preliminary diagnostics look good...

--------------------------------------------------------------
----------------
dmcat | Coef. Std. Err. z P>|z|
[95% Conf.

Interval]
-------------+------------------------------------------------
----------------
_hat | .8900851 .1153855 7.71 0.000
.6639337

1.116236
_hatsq | -.0319886 .0307101 -1.04 0.298
-.0921793

.0282022
_cons | -.0450195 .1069617 -0.42 0.674
-.2546606

.1646215
--------------------------------------------------------------
----------------
lroc

Logistic model for dmcat

number of observations = 3354
area under ROC curve = 0.8647

etc...etc...etc...

My question is should I be concerned with the results of the Boxtid command? Is there something I've done incorrectly or
something else I

can do/should do?

*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/

```
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```