Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: variable selection

From   Nick Cox <>
To   "" <>
Subject   Re: st: variable selection
Date   Wed, 26 Feb 2014 09:47:41 +0000

This is a re-post. Do look at advice in the FAQ on re-posting. One
re-post is fine.

It's hard to advise here. We can't see your data and we can't even
grasp your problem fully. Perhaps this is why no-one replied to the
first posting. The Delphic oracle was notorious for cryptic
predictions given unanswerable questions, and not much progress has
made on that in three thousand years or so.

I don't get from this what your outcome variable is, but

1. Is the relationship between income and your outcome really expected
to be quite different in different regions?

2. I'd be wary of reading too much into small quirks in plots against
income. For example, a linear spline could make sense if you think
there is some threshold to behaviour at a particular income, but if
the evidence for kinks is not consistent, it is probably illusory.
Using the data to guide the analysis is almost as dangerous as
ignoring them (Frank Harrell says something similar in _Regression
modeling strategies_ Springer, New York, 2001).

3. Categorising income would be throwing away information. It's
perhaps more likely that something like log of income will be easier
to work with.


On 26 February 2014 01:02, Maggie Skiles <> wrote:
> Dear all,
> I am performing a logistic regression with variables sex (binary), age
> (binary), income (maybe linear?) and region (categorical - 3 dummies,
> 4 categories).
> I am wanting to look at how these variables play a role on my outcome.
> But, I am also trying to see how these variables - sex, age, and
> income - play a role WITHIN each region. Just based on the EDA, these
> do change largely within the regions.
> I am wondering how to assess my continuous variable. When looking at
> 'income' independently of region by use of lowess plots, it appears
> there is a knot at 45. However, when looking at the lowess plots of
> income for each region, the patterns differ largely (knot at 15, knot
> at 60, one that is linear, and one that is more of a M-shape).
> Is there a way to address this situation? It does not seem like one
> linear spline is appropriate, especially to assess it within regions.
> But it is clearly not a parametric linear line either.
*   For searches and help try:

© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index