# st: RE: Regression across variables

 From "Wallace, John" To "'statalist@hsphsun2.harvard.edu'" Subject st: RE: Regression across variables Date Tue, 11 Nov 2003 15:20:16 -0800

```Thanks for your reply, Nick
I was trying to keep my examples general in the belief that it would be more
broadly useful for others, but for clarity's sake, here's a more explicit
example.

Some of the developmental arrays made by my company have probes
complementary (in the DNA sense) to control reagents at specific
concentrations in the sample fluid.  One way to measure the quality of the
arrays is to perform a regression of signal for those probes against the
known concentration of the control reagents in the sample.  I've found that
the slope and r-squared of the least-squares linear regression correlates
nicely with other measures of array quality, but computing the fit isn't
trivial.  At the moment I export the probe intensities from the analysis
software into excel, line them up against the concentrations for the control
reagents, and use Excel's Slope(y,x) and Rsq(y,x) functions to get the
parameters I'm looking for.
I would prefer to do that in Stata, for all the reasons we love Stata.  The
data looks like:

array_id   a~a_x_at   a~b_x_at   a~c_x_at   a~d_x_at   a~e_x_at
1.     930877       12.4       22.7       51.5        108      293.5
2.     930878        7.6         13       53.1         99      244.2
3.     930898       17.7         37       90.4        198      436.6
4.     930879       11.5       18.2       55.7        114      277.8
5.     930884       11.3       24.1       56.6      126.7      301.3
6.     930885       13.3       19.8         57        139      270.1

the variable names are truncated from affxr2taga_x_at, affxr2tagb_x_at, etc

The Controls are at the following concentrations
TagA: 0.25 E-12M (i.e. 250 femtomolar)
TagB	0.5 E-12M
TagC	1.0 E-12M
TagD	2.0 E-12M
TagE	4.0 E-12M

So, in Excel I would have cells like
A	B	C	D	E
R1	0.25	0.5	1.0	2.0	4.0
R2	12.4	22.7	51.5	108	293.5

And in column F I would use =SLOPE(A2:E2,A1:E1) to get the slope of the
linear regression and =RSQ(A2:E2,A1:E1) to get the coefficient of
determination.

In stata terms, each observation would get a value in new variables "slope"
and "fit".  I've seen some egen commands like rmean() or rsd() that works at
the observation level like that; calculating values in new variables from a
function performed "across" variables for each observation.

One approach I thought about was using -xpose- to switch observations with
variables, then generating a new variable "conc" and doing a plain ol'
regression of array_id vs conc.  That's less attractive though, because
xpose mangles your dataset (even using the ,varnames option, you can't get
the original variable names back by running -xpose- again)

It seems to me, from reading your earlier replies that you think I'd like
to, for example, calculate how much the 6 measures of a~a_x_at correlate
with a constant of 0.25.  That's not the case; I'm interested in how the
slope of (a-e vs pM) varies from array to array.

-JW
-----Original Message-----
From: Nick Cox [mailto:n.j.cox@durham.ac.uk]
Sent: Tuesday, November 11, 2003 11:33 AM
To: statalist@hsphsun2.harvard.edu
Subject: st: RE: RE: RE: Regression across variables

Don't be misled; I am not a statistician myself
and indeed have no formal training in it worth
that name.

However, whatever is posted on Statalist is open
to challenge by anyone who can expose error and/or
put forward a better solution, irrespective of
background.

As I understand it, your molarity values are not
variables at all, but constants which
act as gold standards or targets for your variables.

Whether it makes sense to combine the analyses is difficult
to say without understanding the experimental set-up.
There is much advantage in a unified analysis, especially
if in some sense the errors behave similarly across
molarities, but deciding that might be helped by an
initial exploratory analysis, such as

. dotplot A B C D

Things might look simpler on a log scale.

Nick
n.j.cox@durham.ac.uk

Wallace, John

> Thanks Nick - any implication of non-orthodoxy is purely my
> ignorance in
> these matters.  My formal stat background is pretty weak.
> What I was trying
> to show is that there is in effect a variable orthogonal to
> the matrix of
> observations (the Molarity value) that I would like to
> regress the row of
> values for each observation against the row of Molarity
> values (rather than
> the column of A values against the column of B values, for example).
>
> The question would be how to introduce the molarity values
> into the dataset
> (each variable corresponds to a concentration level that is
> being tested)
> and how to tell stata to use it in the regression.
>
> If the answer is the same, I'll just have to plug away and
> see if I can
> figure out how my mental picture fits into what you said.
>
> I appreciate the help!

Nick Cox

> As I understand it, this is more orthodox
> than you imply, and you could think
> of the analysis as a series of regressions, except that
> you have no covariates, at least that you're
> showing us. That's not fatal, however.
>
> . regress A
>
> says in effect estimate the mean of A,
> and much of the output you get is based
> on the assumption that A follows, or
> should follow, a normal (Gaussian, central)
> distribution.
>
> Following that with
>
> . test _cons = 0.5
>
> is, perhaps, a long-winded way of going
>
> . ttest A = 0.5
>
> except that if you do have covariates,
> the -regress- framework is the one on
> which you can build. Ronan Conroy's
> paper in SJ 2(3) 2002 is a very nice
> example of this principle.
>
> Having said that, the assumption of normality
> is important. It wouldn't surprise me if the
> distributions were skewed and (say) gamma-like,
> so that -glm- is then a better framework.

Wallace, John

> >
> > Hi Statalisters.  I'm trying to get Stata to perform a
> > regression in a data
> > structure different from the usual yvar xvar arrangement.
> > I'll diagram the
> > data set to show what I mean:
> >
> > Molarity	0.5	1	2	3
> >
> > Variable	A	B	C	D
> > Observ1	.22	.45	.99	1.4
> > Observ2	.23	.5	.98	1.5
> > Observ3	.19	.38	1.1	1.42
> >
> > Molarity in this case would be the constant associated with
> > each variable.
> > The observations are measurements of the system attempting
> > to quantify the
> > molarity.  The idea would be to generate additional
> > variables that contain
> > the various regression results of the observations vs Molarity.
> >
> > My data set at this point is just variable name against
> > observation number.
> > I don't know how to associate each variable with the
> > corresponding molarity,
> > or how to tell Stata to perform a regression in this way.
> > Do I have to
> > -reshape- or is there another way?

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```