Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.

# Re: st: Proportional Independent Variables

 From Joerg Luedicke To statalist@hsphsun2.harvard.edu Subject Re: st: Proportional Independent Variables Date Thu, 28 Feb 2013 10:03:09 -0500

```See below:

On Thu, Feb 28, 2013 at 6:11 AM, nick bungy
<nickbungystata@hotmail.co.uk> wrote:
> Hi Joerg,
> Thanks for the reply. Could I ask for a little more of your time for clarification?
> If I consider that y is suitably large that my coefficients are 5000, 6000, 7000 & 8000 respectively. If I slightly adapt and run your example, I find the absolute level of bias between this new set of coefficients and your set are identical. In which case, is this deviation in the coefficients from the true value not just a result of the error you factored in when generating y?

You need to interpret effect sizes (and for that matter, the deviation
of the expected values of the coefficients from true values) in
_relative_ terms. In my example, I generated the Gaussian error with a
standard deviation of 1, and so the effect sizes that I chose
arbitrarily were all rather small, given that a unit increase in x
covers its entire range (e.g. an effect of 0.4 would be interpreted as
a "change" in the outcome of 0.4 standard deviations when x goes from
0 to 1 etc.; this may not be a useful interpretation with real data as
a real x would often not even cover this entire range, also see Nick's
5000,...,8000, did you change the error variance? For example, if you
would like to plug in values from a previous real data fit, you should
use the RMSE of that fit as the standard deviation of the Gaussian
error, at least if you were also interested in the variation of the
estimates.

> In regards to the coefficients generated from a simple OLS, how would I interpret them in the context of there having to be a proportionate response? Is the coefficient still only going to be one half of the story (i.e. it captures the effect of var1 but doesn't capture the proportionate response in any/or of the other vars 2-20)?

If you have a set of variables that sum up to some constant and use
only some of these variables but not all of them, and the omitted
variables have a non-zero effect, then you are facing some apparent
omitted variable bias because all variables in the set are necessarily
correlated due to their nature of summing up to a constant. You could
play around with simulations using a variety of assumptions and data
generating models in order to see how that plays out and how relevant
it might be with respect to your problem.

> The restriction that the omitted variable has an effect of zero seems quite strongly prohibitive. If I want to eventually build a view whereby I can ascertain 'if xi increases and xj is the proportionate response, the change in y is ... ' for all i not equal to j, then eventually this assumption will be violated. Is there a way around this?

I don't know. I don't really have any experience with compositional
data and it is probably best to consult the relevant literature. Also,
Nick's suggestions seem to be very useful in this regard. My point
here merely was that, if unsure about whether certain modeling
assumptions make sense, fabricated data simulations can be very
illuminating.

Joerg

> Many thanks,
> Nick
>> Date: Thu, 28 Feb 2013 01:43:00 -0500
>> Subject: Re: st: Proportional Independent Variables
>> From: joerg.luedicke@gmail.com
>> To: statalist@hsphsun2.harvard.edu
>>
>> I should have added that this is assuming that the omitted variable
>> has an effect of zero. If the effect of the omitted variable is
>> non-zero, then the estimates for the other variables are biased by an
>> amount equal to the effect size of the omitted predictor. For example,
>> if the effect for cnsx1 was 0.1 and the fifth variable (cnsx5) had an
>> effect of 0.1 as well, then the estimate for cnsx1 would be zero when
>> fitting the model without cnsx5 (in expectation).
>>
>> Joerg
>>
>>
>> On Thu, Feb 28, 2013 at 12:39 AM, Joerg Luedicke
>> <joerg.luedicke@gmail.com> wrote:
>> > When unsure about things like these, it is always a good idea to run a
>> > bunch of simulations with fabricated data. Below is some code for
>> > checking consistency of OLS estimates, based on the described set up.
>> > First, we generate 5 variables containing uniform random variates on
>> > the range [0,1), and constrain the variables such that they sum up to
>> > one for each observation. Then, we set up a program to feed to Stata's
>> > -simulate-, and finally inspect the results. You can change sample
>> > size, number of variables, and parameter values in order to closer
>> > resemble your problem at hand.
>> >
>> > The amount of bias looks indeed negligible to me, confirming Nick Cox'
>> > impressions. Efficiency might be a different story though...
>> >
>> > Joerg
>> >
>> > *--------------------------------------------
>> > // Generate data
>> > clear
>> > set obs 500
>> > set seed 1234
>> >
>> > forval i=1/5 {
>> > gen u`i' = runiform()
>> > }
>> >
>> > egen su = rowtotal(u*)
>> > gen wu = 1/su
>> >
>> > forval i=1/5 {
>> > gen cnsx`i' = u`i'*wu
>> > }
>> >
>> > keep cnsx*
>> >
>> > // Set up program for -simulate-
>> > program define mysim, rclass
>> >
>> > cap drop e y
>> > gen e = rnormal()
>> > gen y = 0.1*cnsx1 + 0.2*cnsx2 + ///
>> > 0.3*cnsx3 + 0.4*cnsx4 + e
>> > reg y cnsx1 cnsx2 cnsx3 cnsx4
>> >
>> > forval i = 1/4 {
>> > local b`i' = _b[cnsx`i']
>> > return scalar b`i' = `b`i''
>> > }
>> >
>> > end
>> >
>> > // Run simulations
>> > simulate b1=r(b1) b2=r(b2) b3=r(b3) b4=r(b4), ///
>> > reps(10000) seed(4321) : mysim
>> >
>> > // Results
>> > sum
>> > *--------------------------------------------
>> >
>> >
>> > On Wed, Feb 27, 2013 at 3:40 PM, nick bungy
>> > <nickbungystata@hotmail.co.uk> wrote:
>> >> Dear Statalist,
>> >>
>> >> I have a dependent variable that is continuous
>> >> and a set of 20 independent variables that are percentage based, with
>> >> the condition that the sum of these variables must be 100% across each
>> >> observation. The data is across section only.
>> >>
>> >> I am aware that
>> >> interpretting the coefficients from a general OLS fit will be
>> >> inaccurate. The increase of one of the 20 variables will have to be
>> >> facilitated by a decrease in one or more of the other 19 variables.
>> >>
>> >> Is
>> >> there an approach to get consistent coefficient estimates of these
>> >> parameters that consider the influence of a proportionate decrease in
>> >> one or more of the other 20 variables?
>> >>
>> >> Best,
>> >>
>> >> Nick
>> >>
>> >> *
>> >> * For searches and help try:
>> >> * http://www.stata.com/help.cgi?search
>> >> * http://www.stata.com/support/faqs/resources/statalist-faq/
>> >> * http://www.ats.ucla.edu/stat/stata/
>> *
>> * For searches and help try:
>> * http://www.stata.com/help.cgi?search
>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>> * http://www.ats.ucla.edu/stat/stata/
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/
```