Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.

# Re: st: Proportional Independent Variables

 From Nick Cox To statalist@hsphsun2.harvard.edu Subject Re: st: Proportional Independent Variables Date Thu, 28 Feb 2013 13:32:06 +0000

```You will have to fudge the zeros (#2) before you apply logratios (#1).
As before, a key question is: are they structural (inevitably 0) or
sampling (happen to be or to be reported as 0)?

I got some of the guts of this field coded up as Mata functions a
while back, but there is no documentation and that may not help much.

// compositional data analysis

mata :

mata drop cda_*()

// NJC 1 Sept 2008
// rows scaled to sum to 1
real matrix function cda_closure(real matrix X) {
return(X :/ rowsum(X))

}

// NJC 1 Sept 2008
// ln(all but last column / last column)
real matrix function cda_alr(real matrix X) {
real scalar c, cm1
c = cols(X); cm1 = c - 1
return(ln(X[, (1 .. cm1)]) :- ln(X[, c]))
}

// NJC 1 Sept 2008
// ln(all / row geometric means)
real matrix function cda_clr(real matrix X) {
return(ln(X) :- mean(ln(X'))')
}

// NJC 1 Sept 2008
// centring
real matrix cda_centre(real matrix X) {
real rowvector centre, invcentre
centre = cda_closure(exp(mean(ln(X))))
invcentre = cda_closure((1 :/ centre))
return(cda_closure(X :* invcentre))
}

// NJC 3 Sept 2008
// column geometric means
real matrix cda_colgmean(real matrix X) {
return(exp(mean(ln(X))))
}

// NJC 3 Sept 2008
// row geometric means
real matrix cda_rowgmean(real matrix X) {
return(exp(mean(ln(X'))'))
}

// NJC 2 Sept 2008
// multiplicative replacement for rounded zeros
real matrix cda_mrzero(real matrix X, real rowvector delta, | real
scalar total) {
real matrix iszero
if (total == .) total = 1
iszero = X :== 0
return((iszero :* delta) + ((!iszero) :* X :* (1 :-
rowsum(iszero :* delta) :/ total)))
}

// NJC 10 Oct 2008
// isometric log-ratio transformation
real matrix function cda_ilr(real matrix X) {
real scalar c, j
real matrix Y, lnX
c = cols(X)
Y = X[, (1 .. c - 1)]; lnX = ln(X)
for (j = 1; j < c; j++) {
Y[, j] = rowsum(lnX[, (1 .. j)]) - j * lnX[, j + 1]
Y[, j] = (1 / sqrt(j * (j + 1))) * Y[, j]
}
return(Y)
}

end

On Thu, Feb 28, 2013 at 1:19 PM, nick bungy
<nickbungystata@hotmail.co.uk> wrote:
> Thank you for your responses,
> My thoughts following this discussion are the following:
> 1. Apply a logratio transformation to the data in the short run
> 2. Look into a simplex mixture approach as a longer term aspiration, given my data does have a very large amount of 0's. I noticed the topic was mentioned in the book you kindly linked Nick, so that will be my first avenue to explore.
> Best,
> Nick
>
> ----------------------------------------
>> Date: Thu, 28 Feb 2013 07:35:23 -0500
>> Subject: Re: st: Proportional Independent Variables
>> From: jvverkuilen@gmail.com
>> To: statalist@hsphsun2.harvard.edu
>>
>> On Thu, Feb 28, 2013 at 4:19 AM, Nick Cox <njcoxstata@gmail.com> wrote:
>> >
>> > 2. For different reasons log and logit transformations might be
>> > considered. There is a very inward-looking literature on compositional
>> > data analysis centred on more exotic transformations tailored to the
>> > problem. The reference I gave earlier is one entry into that.
>>
>> I was going to throw out the same reference. It's not a trivial
>> problem, but a narrow one due to the way it's been written. But the
>> walkaway message of most of it is that the log-ratio transformation is
>> the most reasonable one. This all just works out to being logit if you
>> only had two, or log-odds. The logic is very similar to the
>> multinomial logit, with the same difficult dependence structure.
>>
>>
>>
>> > 3. The two previous points are often complicated by measured zeros.
>> > There is then a long slow agony about whether they are structural or
>> > sampling zeros and what to do about them. The more components are
>> > measured, the worse this usually gets, whether it is a fractions of a
>> > budget spent on different things, or proportions of a material by
>> > elements or compounds or particle size classes, or whatever.
>>
>> Yes, this is a real issue, and unfortunately the transformations used
>> can create huge outlier problems, just like log transforms do when
>> there's a 0 value.
>> *
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/
```