Statalist The Stata Listserver


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: RE: ok to include the denominator of ratio dep var as an independent var too?


From   "Nick Cox" <[email protected]>
To   <[email protected]>
Subject   st: RE: ok to include the denominator of ratio dep var as an independent var too?
Date   Thu, 25 Jan 2007 14:18:27 -0000

This is a tricky area. I don't have a definitive answer. 

There is a substantial literature on these problems, under
headings like inbuilt or spurious or ratio correlation, going back 
at least to Karl Pearson. I would 
do a literature search using those headings. The references
I am aware of are in fields like hydrology or geology, so the journals
or texts may not be easily accessible to you and the examples may be 
difficult to map on to your territory. 

Some papers are indeed scary in making warnings like your questioner,
usually with more detail however. But the main examples are
often of toy problems not like yours. Here is one salutary game: 

. set obs 100
obs was 0, now 100

. gen x = uniform()

. gen y = uniform()

. gen z = uniform()

. gen yx = y/x

. gen zx = z/x

. scatter yx zx

. corr yx zx
(obs=100)

             |       yx       zx
-------------+------------------
          yx |   1.0000
          zx |   0.6744   1.0000

Many people in the social sciences would be very happy 
to get a R-sq of 45%. Well, you don't need data at all; 
you can do it by taking ratios of random numbers. 

More to the point: 

. corr yx x
(obs=100)

             |       yx        x
-------------+------------------
          yx |   1.0000
           x |  -0.6172   1.0000

So you should be worried -- but you can get a handle
on these problems by simulation, or perhaps bootstrapping. 
Naturally sampling from distributions more relevant to
your data than the uniform is advisable. 

I think it's pretty clear that you need to be gearing 
your analysis to the research question and try to keep the 
statistical issues secondary. What's most evident from your 
details is that your two models are quite different and so
I would expect that to be echoed in results. 

One strategy might be to explain y/x as far as you can
in terms of variables other than x; and then to look
at the residuals from that model against x to see whether
structure is being missed. That would be a defence against
the charge that the same variable appears on both sides. 

I think the bottom line is to acknowledge that ratioing
can induce artefacts, but to assert that that does not 
rule out genuine relationships also existing. 

Incidentally, my guess is that penetration ratio 
would be better considered on a log scale, for
all sorts of reasons. I would expect some quirky small
economies to have very high penetration ratios, but
it's some years since I studied economics.  
). 

Nick 
[email protected] 

Jason Yackee
 
> I just received this question at a presentation of a paper 
> and I wasn’t sure how to answer it.
> 
> I have a panel data set, and a model that is of the general 
> form: (y/x) = a + b + c+…+ x.  My dependent variable (y/x) is 
> a ratio of the total dollar amount of foreign capital inflows 
> that a host country receives in a given year as a ratio of 
> the host country’s GDP in that same year (annual capital 
> inflows = y, gdp = x in the model above).  
> 
> This ratio is called the “penetration ratio” in the 
> literature.  I also included GDP on the right-hand side of 
> the equation as a control for each country’s overall economic 
> size.  The GDP variable was a significant, negative predictor 
> of the penetration ratio.  Larger GDP → Less Penetration.  
> 
> The questioner said that it was improper to have “GDP” on 
> “both sides of the equation”, and that it was sufficient to 
> have a model of the form y = a + b + c +…+ x, where “x” is 
> GDP is “y” is simply the dollar value of foreign capital 
> inflows in absolute, not ratio, form. He couldn't explain 
> why.  I couldn't explain why not.
> 
> I re-ran the model in the form the questioner suggested, and 
> the results are overall quite different for the theoretically 
> interesting independent variables.  But my own sense is still 
> that the questioner is wrong, and that my original model was 
> not necessarily improperly specified.  But I don’t have the 
> mastery of statistics to justify my “sense”.  
> 
> Would some kind soul be able to weigh in before my next presentation?

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index