Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: RE: RE: unexplained discrepancy between


From   Nick Cox <[email protected]>
To   [email protected]
Subject   Re: st: RE: RE: unexplained discrepancy between
Date   Thu, 22 Dec 2011 00:55:13 +0000

I got it, belatedly. Your code

local y cnratio_min_p cnratio_max_p cnratio_mean_p
//create the median cut points for each min, max and mean ratio

foreach x of local y    {
               egen `x'_median = median(`x')
               label variable `x'_median "Median Cut Point `x'"
}

*** create hilo vars with loop ***
local x cnminhilo cnmaxhilo cnmeanhilo


foreach var of local x  {
       gen `var' = .
               foreach val in `y' {
       replace `var' = 1 if `val' > `val'_median & `val' < .
       replace `var' = 0 if `val' <= `val'_median
       }
}

can be rewritten like this

foreach x in cnratio_min_p cnratio_max_p cnratio_mean_p  {
               egen `x'_median = median(`x')
               label variable `x'_median "Median Cut Point `x'"
}

foreach var in cnminhilo cnmaxhilo cnmeanhilo  {
       gen `var' = .
       foreach val in  cnratio_min_p cnratio_max_p cnratio_mean_p  {
                 replace `var' = 1 if `val' > `val'_median & `val' < .
                 replace `var' = 0 if `val' <= `val'_median
       }
}

If you trace through the second block, you will find that results for
min and max are always overwritten with those for mean. You have two
nested loops but the problem calls for only a single loop. In
shortening your code, I was also correcting it because I could see
what you wanted.

Nick

Steve Nakoneshny [email protected] via hsphsun2.harvard.edu
9:29 PM (3 hours ago)
to statalist


Nick,

Thank you very much for the quick response. Like you, I had initially
suspected that the issue related to precision, but I wanted the
opinion of the list to validate my assumption. Just as interesting:
when I substituted in the condensed loop that you provided for the
longer one we had written, the discrepancies disappeared. Rather than
chasing that down any further, we inserted your loop into our do file
and will chalk that one up to experience. I hadn't been aware of the
-cond- function previously.

Thanks,
Steve



On 2011-12-21, at 11:46 AM, Nick Cox wrote:

> No; belay that. You are using the same median variables in both comparisons according to this code.
>
> So, what I said looks wrong. Sorry,
>
> Nick
> [email protected]
>
>
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On Behalf Of Nick Cox
> Sent: 21 December 2011 18:22
> To: '[email protected]'
> Subject: st: RE: unexplained discrepancy between
>
> At the heart of this are comparisons that include a test for equality with the median. In principle the median will always be one of the observed values if the sample size is odd but it may not be if the sample size is even. So in principle looking for equality with the median clearly does make sense.
>
> However, looking at the code for -egen, median()- makes it evident that that function is doing its calculations with a -double- version of the variable supplied; you make clear that you are working with -float-s so I suspect therefore that there is a minor problem of precision here as your other operations are not guaranteed to produce identical results.
>
> In general it is optimistic to do any testing for equality with a non-integer value unless you maintain absolute consistency of variable types throughout.
>
> A secondary issue is that your code can be cut right down. On the evidence here you don't need a separate variable holding the median for each variable.
>
> foreach x in min max mean {
>               local var cnratio_`x'_p
>               qui su `var', detail
>               gen cn`x'hilo = cond(missing(`var'), . , `var' > r(p50))
> }
>
> That aside, if you -list- the values for which the indicator is 1 one way and 0 the other way I suspect that you will find that they are very close indeed and that all that has happened is that a knife-edge decision went different ways depending on a few bits, perhaps even one.
>
> Nick
> [email protected]
>
> Steve Nakoneshny
>
> We are working with a dataset of biomarker expression data. A colleague created some dummy variables using the median value as a dichotomous cut point for high / low expression. We also felt that this process would lend itself extremely well to using a loop. Here is the code we wrote / executed:
>
> --- begin code ---
>
> local y cnratio_min_p cnratio_max_p cnratio_mean_p            //create the median cut points for each min, max and mean ratio
>
> foreach x of local y  {
>               egen `x'_median = median(`x')
>               label variable `x'_median "Median Cut Point `x'"
> }
>
> *** create hilo vars with loop ***
> local x cnminhilo cnmaxhilo cnmeanhilo
>
>
> foreach var of local x        {
>       gen `var' = .
>               foreach val in `y' {
>       replace `var' = 1 if `val' > `val'_median & `val' < .
>       replace `var' = 0 if `val' <= `val'_median
>       }
> }
>
> *** Here's the old school way to create hilo variables for each of cn min max and mean ***
> gen cnminhilo_jcd = .
> replace cnminhilo_jcd=1 if cnratio_min_p > cnratio_min_p_median & cnratio_min_p < .
> replace cnminhilo_jcd=0 if cnratio_min_p <= cnratio_min_p_median
> label variable cnminhilo_jcd "CN Ratio > Median Cutpoint of Min"
> tab cnminhilo_jcd,m
>
> gen cnmaxhilo_jcd = .
> replace cnmaxhilo_jcd =1 if cnratio_max_p > cnratio_max_p_median & cnratio_max_p < .
> replace cnmaxhilo_jcd =0 if cnratio_max_p <= cnratio_max_p_median
> label variable cnmaxhilo_jcd "CN Ratio > Median Cutpoint of Max"
> tab cnmaxhilo_jcd,m
>
> gen cnmeanhilo_jcd = .
> replace cnmeanhilo_jcd =1 if cnratio_mean_p > cnratio_mean_p_median & cnratio_mean_p < .
> replace cnmeanhilo_jcd =0 if cnratio_mean_p <= cnratio_mean_p_median
> label variable cnmeanhilo_jcd "CN Ratio > Median Cutpoint of Mean"
> tab cnmeanhilo_jcd,m
>
> --- end code ---
>
>
>
> We then crosstabbed the results from each method to validate the results and found some discrepancies. Here is the output:
>
> --- begin code ---
>
>
> . tab cnminhilo cnminhilo_jcd,m
>
>           |  CN Ratio > Median Cutpoint of
>           |               Min
> cnminhilo |         0          1          . |     Total
> -----------+---------------------------------+----------
>         0 |        51          6          0 |        57
>         1 |         6         50          0 |        56
>         . |         0          0         13 |        13
> -----------+---------------------------------+----------
>     Total |        57         56         13 |       126
>
>
> . tab cnmaxhilo cnmaxhilo_jcd,m
>
>           |  CN Ratio > Median Cutpoint of
>           |               Max
> cnmaxhilo |         0          1          . |     Total
> -----------+---------------------------------+----------
>         0 |        50          7          0 |        57
>         1 |         7         49          0 |        56
>         . |         0          0         13 |        13
> -----------+---------------------------------+----------
>     Total |        57         56         13 |       126
>
>
> . tab cnmeanhilo cnmeanhilo_jcd,m
>
>           |  CN Ratio > Median Cutpoint of
>           |               Mean
> cnmeanhilo |         0          1          . |     Total
> -----------+---------------------------------+----------
>         0 |        57          0          0 |        57
>         1 |         0         56          0 |        56
>         . |         0          0         13 |        13
> -----------+---------------------------------+----------
>     Total |        57         56         13 |       126
>
> --- end code ---
>
> We then explored the data and found that the 6 obs where cnminhilo==1 & cnminhilo_jcd==0 were incorrectly coded in cnminhilo. The same held true for the other discrepancies in cnminhilo and cnmaxhilo.
>
> We've looked at the syntax of the loop and cannot see any differences between it and the longer hand-coding method used. We're at a loss to explain why and how these discrepancies arose. If it helps at all, all variables used here are stored as floats and we're using Stata/IC 11.2 for Mac. Hopefully someone can help enlighten us.
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

 Reply Forward

SendSave [email protected]
[email protected]
Cc
Bcc
 Add Cc   Add Bcc   Edit Subject   Attach a file   Consider including:
[email protected] Abdul Q Memon Akresh, Ilana Redstone
Subject

 Attach a file

Loading rich text... Rich formatting »Check Spelling Resume Editing

 I got it, belatedly.

On Wed, Dec 21, 2011 at 9:29 PM, Steve Nakoneshny <[email protected]> wrote:
> Nick,
>
> Thank you very much for the quick response. Like you, I had initially suspected that the issue related to precision, but I wanted the opinion of the list to validate my assumption. Just as interesting: when I substituted in the condensed loop that you provided for the longer one we had written, the discrepancies disappeared. Rather than chasing that down any further, we inserted your loop into our do file and will chalk that one up to experience. I hadn't been aware of the -cond- function previously.
>
> Thanks,
> Steve
>
> On 2011-12-21, at 11:46 AM, Nick Cox wrote:
>
>> No; belay that. You are using the same median variables in both comparisons according to this code.
>>
>> So, what I said looks wrong. Sorry,
>>
>> Nick
>> [email protected]
>>
>>
>> -----Original Message-----
>> From: [email protected] [mailto:[email protected]] On Behalf Of Nick Cox
>> Sent: 21 December 2011 18:22
>> To: '[email protected]'
>> Subject: st: RE: unexplained discrepancy between
>>
>> At the heart of this are comparisons that include a test for equality with the median. In principle the median will always be one of the observed values if the sample size is odd but it may not be if the sample size is even. So in principle looking for equality with the median clearly does make sense.
>>
>> However, looking at the code for -egen, median()- makes it evident that that function is doing its calculations with a -double- version of the variable supplied; you make clear that you are working with -float-s so I suspect therefore that there is a minor problem of precision here as your other operations are not guaranteed to produce identical results.
>>
>> In general it is optimistic to do any testing for equality with a non-integer value unless you maintain absolute consistency of variable types throughout.
>>
>> A secondary issue is that your code can be cut right down. On the evidence here you don't need a separate variable holding the median for each variable.
>>
>> foreach x in min max mean {
>>               local var cnratio_`x'_p
>>               qui su `var', detail
>>               gen cn`x'hilo = cond(missing(`var'), . , `var' > r(p50))
>> }
>>
>> That aside, if you -list- the values for which the indicator is 1 one way and 0 the other way I suspect that you will find that they are very close indeed and that all that has happened is that a knife-edge decision went different ways depending on a few bits, perhaps even one.
>>
>> Nick
>> [email protected]
>>
>> Steve Nakoneshny
>>
>> We are working with a dataset of biomarker expression data. A colleague created some dummy variables using the median value as a dichotomous cut point for high / low expression. We also felt that this process would lend itself extremely well to using a loop. Here is the code we wrote / executed:
>>
>> --- begin code ---
>>
>> local y cnratio_min_p cnratio_max_p cnratio_mean_p            //create the median cut points for each min, max and mean ratio
>>
>> foreach x of local y  {
>>               egen `x'_median = median(`x')
>>               label variable `x'_median "Median Cut Point `x'"
>> }
>>
>> *** create hilo vars with loop ***
>> local x cnminhilo cnmaxhilo cnmeanhilo
>>
>>
>> foreach var of local x        {
>>       gen `var' = .
>>               foreach val in `y' {
>>       replace `var' = 1 if `val' > `val'_median & `val' < .
>>       replace `var' = 0 if `val' <= `val'_median
>>       }
>> }
>>
>> *** Here's the old school way to create hilo variables for each of cn min max and mean ***
>> gen cnminhilo_jcd = .
>> replace cnminhilo_jcd=1 if cnratio_min_p > cnratio_min_p_median & cnratio_min_p < .
>> replace cnminhilo_jcd=0 if cnratio_min_p <= cnratio_min_p_median
>> label variable cnminhilo_jcd "CN Ratio > Median Cutpoint of Min"
>> tab cnminhilo_jcd,m
>>
>> gen cnmaxhilo_jcd = .
>> replace cnmaxhilo_jcd =1 if cnratio_max_p > cnratio_max_p_median & cnratio_max_p < .
>> replace cnmaxhilo_jcd =0 if cnratio_max_p <= cnratio_max_p_median
>> label variable cnmaxhilo_jcd "CN Ratio > Median Cutpoint of Max"
>> tab cnmaxhilo_jcd,m
>>
>> gen cnmeanhilo_jcd = .
>> replace cnmeanhilo_jcd =1 if cnratio_mean_p > cnratio_mean_p_median & cnratio_mean_p < .
>> replace cnmeanhilo_jcd =0 if cnratio_mean_p <= cnratio_mean_p_median
>> label variable cnmeanhilo_jcd "CN Ratio > Median Cutpoint of Mean"
>> tab cnmeanhilo_jcd,m
>>
>> --- end code ---
>>
>>
>>
>> We then crosstabbed the results from each method to validate the results and found some discrepancies. Here is the output:
>>
>> --- begin code ---
>>
>>
>> . tab cnminhilo cnminhilo_jcd,m
>>
>>           |  CN Ratio > Median Cutpoint of
>>           |               Min
>> cnminhilo |         0          1          . |     Total
>> -----------+---------------------------------+----------
>>         0 |        51          6          0 |        57
>>         1 |         6         50          0 |        56
>>         . |         0          0         13 |        13
>> -----------+---------------------------------+----------
>>     Total |        57         56         13 |       126
>>
>>
>> . tab cnmaxhilo cnmaxhilo_jcd,m
>>
>>           |  CN Ratio > Median Cutpoint of
>>           |               Max
>> cnmaxhilo |         0          1          . |     Total
>> -----------+---------------------------------+----------
>>         0 |        50          7          0 |        57
>>         1 |         7         49          0 |        56
>>         . |         0          0         13 |        13
>> -----------+---------------------------------+----------
>>     Total |        57         56         13 |       126
>>
>>
>> . tab cnmeanhilo cnmeanhilo_jcd,m
>>
>>           |  CN Ratio > Median Cutpoint of
>>           |               Mean
>> cnmeanhilo |         0          1          . |     Total
>> -----------+---------------------------------+----------
>>         0 |        57          0          0 |        57
>>         1 |         0         56          0 |        56
>>         . |         0          0         13 |        13
>> -----------+---------------------------------+----------
>>     Total |        57         56         13 |       126
>>
>> --- end code ---
>>
>> We then explored the data and found that the 6 obs where cnminhilo==1 & cnminhilo_jcd==0 were incorrectly coded in cnminhilo. The same held true for the other discrepancies in cnminhilo and cnmaxhilo.
>>
>> We've looked at the syntax of the loop and cannot see any differences between it and the longer hand-coding method used. We're at a loss to explain why and how these discrepancies arose. If it helps at all, all variables used here are stored as floats and we're using Stata/IC 11.2 for Mac. Hopefully someone can help enlighten us.
>>
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/statalist/faq
>> *   http://www.ats.ucla.edu/stat/stata/
>>
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/statalist/faq
>> *   http://www.ats.ucla.edu/stat/stata/
>
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/




SendSave NowDiscardData AnalysisAds – Why this ad?Download Whitepaper.
Turn Data into Intelligence.
www.sas.com/uk People (3)Steve Nakoneshny
[email protected]

    Show details
Ads – Why these ads?
Data Analysis
Download Whitepaper.
Turn Data into Intelligence.
www.sas.com/uk
Meet Your CPD Quota
Not Long Left To Complete Your CPD Hours - Stay Compliant With BPP
www.bpp.com/CPD
Software For Data Analysis
Voxco - Statistical Analysis and Reporting for Market Research.
www.voxco.com
Sound Tests 0845 460 0003
UKAS Accredited Sound Testing Building Regulations Part E
soundadviceacoustics.co.uk
More about...
Variable Annuities Rate »

Assumption »

STATA »

Variable Data Printing »

Create Graph »

Bar Chart »

Bar Graphing »

XY Graph »


1% full
Using 78 MB of your 7660 MB ©2011 Google - Terms & Privacy
Last account activity: 4 hours ago
DetailsLet people know what you're up to, or share links to photos,
videos, and web pages.

On Wed, Dec 21, 2011 at 9:29 PM, Steve Nakoneshny <[email protected]> wrote:
> Nick,
>
> Thank you very much for the quick response. Like you, I had initially suspected that the issue related to precision, but I wanted the opinion of the list to validate my assumption. Just as interesting: when I substituted in the condensed loop that you provided for the longer one we had written, the discrepancies disappeared. Rather than chasing that down any further, we inserted your loop into our do file and will chalk that one up to experience. I hadn't been aware of the -cond- function previously.
>
> Thanks,
> Steve
>
> On 2011-12-21, at 11:46 AM, Nick Cox wrote:
>
>> No; belay that. You are using the same median variables in both comparisons according to this code.
>>
>> So, what I said looks wrong. Sorry,
>>
>> Nick
>> [email protected]
>>
>>
>> -----Original Message-----
>> From: [email protected] [mailto:[email protected]] On Behalf Of Nick Cox
>> Sent: 21 December 2011 18:22
>> To: '[email protected]'
>> Subject: st: RE: unexplained discrepancy between
>>
>> At the heart of this are comparisons that include a test for equality with the median. In principle the median will always be one of the observed values if the sample size is odd but it may not be if the sample size is even. So in principle looking for equality with the median clearly does make sense.
>>
>> However, looking at the code for -egen, median()- makes it evident that that function is doing its calculations with a -double- version of the variable supplied; you make clear that you are working with -float-s so I suspect therefore that there is a minor problem of precision here as your other operations are not guaranteed to produce identical results.
>>
>> In general it is optimistic to do any testing for equality with a non-integer value unless you maintain absolute consistency of variable types throughout.
>>
>> A secondary issue is that your code can be cut right down. On the evidence here you don't need a separate variable holding the median for each variable.
>>
>> foreach x in min max mean {
>>               local var cnratio_`x'_p
>>               qui su `var', detail
>>               gen cn`x'hilo = cond(missing(`var'), . , `var' > r(p50))
>> }
>>
>> That aside, if you -list- the values for which the indicator is 1 one way and 0 the other way I suspect that you will find that they are very close indeed and that all that has happened is that a knife-edge decision went different ways depending on a few bits, perhaps even one.
>>
>> Nick
>> [email protected]
>>
>> Steve Nakoneshny
>>
>> We are working with a dataset of biomarker expression data. A colleague created some dummy variables using the median value as a dichotomous cut point for high / low expression. We also felt that this process would lend itself extremely well to using a loop. Here is the code we wrote / executed:
>>
>> --- begin code ---
>>
>> local y cnratio_min_p cnratio_max_p cnratio_mean_p            //create the median cut points for each min, max and mean ratio
>>
>> foreach x of local y  {
>>               egen `x'_median = median(`x')
>>               label variable `x'_median "Median Cut Point `x'"
>> }
>>
>> *** create hilo vars with loop ***
>> local x cnminhilo cnmaxhilo cnmeanhilo
>>
>>
>> foreach var of local x        {
>>       gen `var' = .
>>               foreach val in `y' {
>>       replace `var' = 1 if `val' > `val'_median & `val' < .
>>       replace `var' = 0 if `val' <= `val'_median
>>       }
>> }
>>
>> *** Here's the old school way to create hilo variables for each of cn min max and mean ***
>> gen cnminhilo_jcd = .
>> replace cnminhilo_jcd=1 if cnratio_min_p > cnratio_min_p_median & cnratio_min_p < .
>> replace cnminhilo_jcd=0 if cnratio_min_p <= cnratio_min_p_median
>> label variable cnminhilo_jcd "CN Ratio > Median Cutpoint of Min"
>> tab cnminhilo_jcd,m
>>
>> gen cnmaxhilo_jcd = .
>> replace cnmaxhilo_jcd =1 if cnratio_max_p > cnratio_max_p_median & cnratio_max_p < .
>> replace cnmaxhilo_jcd =0 if cnratio_max_p <= cnratio_max_p_median
>> label variable cnmaxhilo_jcd "CN Ratio > Median Cutpoint of Max"
>> tab cnmaxhilo_jcd,m
>>
>> gen cnmeanhilo_jcd = .
>> replace cnmeanhilo_jcd =1 if cnratio_mean_p > cnratio_mean_p_median & cnratio_mean_p < .
>> replace cnmeanhilo_jcd =0 if cnratio_mean_p <= cnratio_mean_p_median
>> label variable cnmeanhilo_jcd "CN Ratio > Median Cutpoint of Mean"
>> tab cnmeanhilo_jcd,m
>>
>> --- end code ---
>>
>>
>>
>> We then crosstabbed the results from each method to validate the results and found some discrepancies. Here is the output:
>>
>> --- begin code ---
>>
>>
>> . tab cnminhilo cnminhilo_jcd,m
>>
>>           |  CN Ratio > Median Cutpoint of
>>           |               Min
>> cnminhilo |         0          1          . |     Total
>> -----------+---------------------------------+----------
>>         0 |        51          6          0 |        57
>>         1 |         6         50          0 |        56
>>         . |         0          0         13 |        13
>> -----------+---------------------------------+----------
>>     Total |        57         56         13 |       126
>>
>>
>> . tab cnmaxhilo cnmaxhilo_jcd,m
>>
>>           |  CN Ratio > Median Cutpoint of
>>           |               Max
>> cnmaxhilo |         0          1          . |     Total
>> -----------+---------------------------------+----------
>>         0 |        50          7          0 |        57
>>         1 |         7         49          0 |        56
>>         . |         0          0         13 |        13
>> -----------+---------------------------------+----------
>>     Total |        57         56         13 |       126
>>
>>
>> . tab cnmeanhilo cnmeanhilo_jcd,m
>>
>>           |  CN Ratio > Median Cutpoint of
>>           |               Mean
>> cnmeanhilo |         0          1          . |     Total
>> -----------+---------------------------------+----------
>>         0 |        57          0          0 |        57
>>         1 |         0         56          0 |        56
>>         . |         0          0         13 |        13
>> -----------+---------------------------------+----------
>>     Total |        57         56         13 |       126
>>
>> --- end code ---
>>
>> We then explored the data and found that the 6 obs where cnminhilo==1 & cnminhilo_jcd==0 were incorrectly coded in cnminhilo. The same held true for the other discrepancies in cnminhilo and cnmaxhilo.
>>
>> We've looked at the syntax of the loop and cannot see any differences between it and the longer hand-coding method used. We're at a loss to explain why and how these discrepancies arose. If it helps at all, all variables used here are stored as floats and we're using Stata/IC 11.2 for Mac. Hopefully someone can help enlighten us.
>>
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/statalist/faq
>> *   http://www.ats.ucla.edu/stat/stata/
>>
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/statalist/faq
>> *   http://www.ats.ucla.edu/stat/stata/
>
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index