Doug Mounce <[email protected]> asked about linear regression
with binary independent variables,
> Any recommendations for a good online description and explanation
> for applying the multiple partial Ftest? Also, what's the general
> way to think about using regress when the Y variable is continuous,
> but the X is binary?
>
> We've been learning in class how to describe the regress of one
> continuous variable on another, and I understand how to do a log
> transform and look for curvature or heteroscedasticity. We do
> jackknife residuals and some other diagnostics, and I'm stumped on
> how to describe the regress when the independent variable is binary.
> I can talk about the regression coefficient, I guess, but the R^2
> doesn't accountfor much variation in this data.
Doug also said, "Sorry if asking for help with homework is bad form, ...".
It is never bad form to ask for understanding, so I'm going to rattle
on a bit.
One important thing to learn is that any table of means can be
reproduced as a linear regression. For instance, consider the
following (oneway) table
avg.
Sex  blood pressure
+
Male  140
Female  138
Are male and female avg. blood pressure different? Linear regression can
answer that question, and I will show how.
Or consider the following (twoway) table
untreated treated
avg. avg.
Sex  blood pressure blood pressure
+
Male  170 160
Female  165 158
Are the effects of treatment different for males and females? Linear
regression can answer that question.
Or consider the following (manyway) table:
presence of  untreated treatment 1 treatment 2
complications  avg bp avg bp avgbp
+
Males 4050: 
Without  152 etc.
With  167
Males 5160: 
Without  etc.
With 
Males 6070: 
Without 
With 

Females 4050:
Without 
With 
Females 5160:
Without 
With 
Females 6070:
Without 
With 

There are lots of questions we could ask (and answer) with linear
regression. Do complications matter? If they do, do they matter
for females? Does age matter? Maybe it only matters when there are
complications? Are treatment 1 and treatment 2 equivalent among males?
How about all of the above?
With linear regression, we can
1. Fill in the tables.
2. Answer any of the above questions, and more.
3. Fill in constrained tables, tables that use the data efficiently
under some assumption, such as that age only matters when
there are complications, and that complications only matter
among males, all at the same time.
In fact, once you get into this, I predict you will find that
linearregression results are easier to understand and to interpret than the
tables, although you will still want to fill in the tables when presenting
results to nonprofessionals. Everybody knows how to read a table.
In any case, there is a onetoone correspondence between tables and
linear regressions. They are different ways of writing down the
same thing.
Consider the following linear regression:
bp = b0 + b1*female + noise
where bp is blood pressure, female is a variable that is 1 if the subject is
female and 0 otherwise, and b0 and b1 are coefficients to be estimated.
An important feature of linear regression is its relationship to means.
E(bp) = E(b0 + b1*female + noise) (E() is expectation operator)
= E(b0) + E(b1*female) + E(noise)
= b0 + b1*E(female)
E() is the expectation operator, and E(bp) means (is defined to be) the
mean of blood pressure. E(female) means (is defined to be) the mean
value of variable female, which is the proportion female in the population.
Let's take the equation
E(bp) = b0 + b1*E(female)
and use it to answer some questions.
What is the average blood pressure for males? Answer: In the case of males,
variable female is always equal to 0, therefore, the mean value of variable
female among males is 0. Therefore, the average blood pressure is
E(bpmale) = b0
What is the average blood pressure of females? Answer: In the case of
females, variable females is always equal to 1. Therefore, the mean value
of variable female among females is 1. Therefore, the average blood pressure
is
E(bpfemale) = b0 + b1
To wit, coefficient b1 measures the difference between female and male
average blood pressures.
Are male and female average blood pressures the same? That is merely a
question of whether coefficient b1 is 0. We can read that off the
tstatistic reported by the regression, or we can type
. regress bp female
. test female==0
Now let's add a treatment effect,
. regress bp female treatment
which runs the regression
bp = b0 + b1*female + b2*treatment + noise
playing the same expectedvalue (i.e., mean) game, we get the following:
untreated treated
avg. avg.
Sex  blood pressure blood pressure
+
Male  b0 b0+b2
Female  b0+b1 b0+b1+b2
If we look closely at the above table we will see that we constrained the
effect of treatment to be the same for males and females. Subtract treated
from untreated in each row:
males: b0+b2  b0 = b2
females: b0+b1+b2  (b0+b1) = b2
If we wanted to run a regression where the female treatment effect could be
different from the male treatment effect, we need to add a coefficient for
it:
. gen treatedfem = treatment*female
. regress bp female treatment treatedfem
which estimates the regression,
bp = b0 + b1*female + b2*treatment + b3*female*treated + noise
and our mean table now is,
untreated treated
avg. avg.
Sex  blood pressure blood pressure
+
Male  b0 b0+b2
Female  b0+b1 b0+b1+b2+b3
Let's use these results to answer the following questions:
(1) Is the effect of treatment the same for males and females? That is
just a question of whether b3==0.
(2) Does treatment have an effect among males? It does not if b2==0.
(3) Does treatment have an effect among females? It does not if b2+b3==0.
(4) Does treatment in general have an effect? It does not if b2==0 AND
b2+b3==0. That turns out to have the same answer as b2==0 AND b3==0,
which is interesting, but not substantively important.
Let's answer the four questions:
. test treatedfem==0 (1)
. test treatment==0 (2)
. test treatment+treatedfem==0 (3)
. test treatment==0
. test treatment+treatedfem==0, accum (4)
This last test is the multiple partial F test about which Doug asked.
Doug needs to put together some other tables in linearregression form on his
own. Eventually, he'll get to the point where where he can write down the
linear regression without thinking.
What about R^2?

Doug asked about R^2. He mentioned that the R^2 did not account for much of
the variation in his data.
Who cares?
The R^2 is just a reflection of the variance of the noise term.
Let's go back to our simplest regression:
bp = b0 + b1*female + noise
If the R^2 is small, then that means the the blood pressure of individual
patients exhibits substantial variation about the mean, but that does not
invalidate the mean. Nor does it invalidate the tests performed on the mean.
They all take into account the variation. More variation means that you will
need more data to uncover an effect, assuming it is present.
More variation means that, if you don't find an effect, that may be due to
insufficient data, but in linear regression, that's pretty easy to detect.
Are males the same as females? In the above linear regression, that
translates to whether b1==0. Say you cannot reject that b1 is equal to 0.
Look at your regression output. Look at the 95% confidence interval reported
for coefficient b1. The 95% C.I. will include 0, but I don't care about that.
Look at the lower and upper bounds. Are they near 0, or are they large? That
answers the ignorance question. If the 95% confidence interval is [48,52]
and we're talking about blood pressure, then you didn't measure the b1 effect
very precisely. You really don't know whether b1 is 0. If the 95%
confidence interval is [4, 3], then I'd say you've pretty well established
that the difference between males and females is small.
That's one of the best features of regression. Look at the coefficients
and their CI's, and you can see what you measured and how well you measured
it. That's much more informative than just reporting a test result.
ASIDE: We were once interviewing a young graduate at StataCorp.
He was evaluating the efficacy of remote versus inclass teaching.
He had done a small survey. He reported ANOVA results  linear
regression, but without the coefficients. The effect of remote
learning was insigificant at the 95% level, he reported gleefully.
"What was the point estimate?" we asked, "what was the CI?" He
didn't know. Remember, his survey had few observations. Therefore,
we pointed out, you really don't know whether remote teaching is
equally effective, do you? Maybe, we said, you're survey was too small
to cast light on the subject. In fact, we continued to press,
a backofthe envelope calculation suggests that is precisely
the case. Devastating. More devastating for his thesis advisor.
What about heteroscedasticity?

The equivalent of heteroscedasticity when all RHS variables are 0 or 1 is
heterogeneous variances for different groups. When we estimate
bp = b0 + b1*female + noise
we assume that E(noise^2)==constant, which is to say, the variance in
blood pressure is the same for males and females. What if it isn't?
Fact is, meanequality tests are pretty robust to a violation of this
assumption, so I'm pretty calm about it. If you have a pvalue of
.001, variance inequality is not going to make the difference vanish.
When variances are different, you are using the data inefficiently. Some
observations are more informative than others, and you can exploit that to get
better estimates, which can make the improved test move either way. You can
estimate the variances for the subgroups by retrieving the residuals and
squaring them, and then you can use those estimates of variance to weight the
data. I won't go into that here. Point is, if you have to do that to get
your result, I'm pretty suspicious. Also true, do that and your result
vanishes, and I'm equally suspicious. Fact is, it rarely happens.
 Bill
[email protected]
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/