Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Re: Bug in -use- or -if- ?


From   Sergiy Radyakin <serjradyakin@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Re: Bug in -use- or -if- ?
Date   Thu, 5 Feb 2009 10:18:56 -0500

Dear Joseph,

I agree that this might be a plausible explanation of what's going on,
but this is hardly what one would expect. E.g. when I type a variable
name in {use filename.dta if age>18}, I mean the variable age in the
dataset filename.dta, not in the current dataset in memory. Following
the same logic, _n must also refer to the observation number in the
file. Otherwise, what would be a valid example of using _n in an
if-condition after -use-? (if I understand you correctly it will
always evaluate either to zero or one for the whole statement -use-,
not for each observation).

The definition of _n that you quote never refers to the location of
observations. "Number of current observation" where? In memory or in
the file? For all other commands it is perfectly clear - in the
memory. But for -use-, it just doesn't make sense to refer to memory
since it just might be empty. Say, I am reading a file with
   use filename.dta if (age>18) & (int(_n/2)==_n/2)
taking all even observations of persons older than 18.
Then when Stata reads data it maintains two counters, one call it FC
is the record number in the file, and another MC in the memory (and
FC>=MC always). Evaluating _n to MC makes little sense since I may not
a priori know how many results will the if-condition fetch. But
evaluating _n to FC makes perfect sense because it defines which part
of the file is eligible for load.

Best regards,
   Sergiy Radyakin

On Wed, Feb 4, 2009 at 8:59 PM, Joseph Coveney <jcoveney@bigplanet.com> wrote:
> Sergiy, I believe that your colleague is correct in how Stata interprets the
> underscore variable, _n.  The help file states, " _n contains the number of
> the current observation."  And it also appears to qualifie -if- according to
> the same criterion while -use- reads data in from a dataset file.  If you're
> loading a dataset from a disc file, _n is incremented as each observation's
> record is read into memory.  So, -if _n <= 37- will work, because _n will
> increase from zero to 37 as further records are loaded and _n == 1, 2, 3,
> etc.
> tests True as being less than 37.  But, starting from -clear- (with _n equal
> to zero), -if _n > 37- will never be True, because, as each candidate
> observation record is read into memory for testing of the condition, _n
> would
> only ever be equal to one, which is never greater than 37.  And because the
> condition tests as False at each test of -if _n > 37-, each successive
> candidate record in the file on disc will be rejected--no observation
> records
> will ever be read into memory.
>
> The same holds for -if inrange(_n, 2, 20)-; starting with _n equal to zero
> (empty in-memory dataset), _n will only be at most one as each successive
> record is read and tested for the truth of -inrange(_n, 2, 20)-.   _n will
> never be between 2 and 20 and so each successive candidate record will be
> rejected, leaving a dataset in memory of zero observations at the end.
>
> Joseph Coveney
>
> Sergiy Radyakin wrote:
>
>> in a different thread Dan Blanchette asked about cooperation of -in-
>> and -if-. I have asked myself a slightly different question whether
>> specifying if-conditions can always substitute for in-conditions: e.g.
>> instead of "in #A/#B" one can type "if inrange(_n,#A,#B)".
>>
>> There seems to be a bug in -use- that get's confused by such a
>> condition. My colleague has suggested that this might happen because
>> Stata will qualify _n according to the current dataset in memory, but
>> qualify if- for the dataset during the load. I was able to come up
>> with an example where it get's confused unconditionally on the current
>> dataset. It seems that the conditon "larger" is not evaluated properly
>> in this case.
>>
>> *** bug with use ... if F(_n)
>> *** N(auto.dta)=74
>>
>> sysuse auto, clear
>> local fullauto `r(fn)'
>>
>> use `"`fullauto'"' in 1/37, clear
>> count
>> assert (_N==37)
>>
>> use `"`fullauto'"' in 38/74, clear
>> count
>> assert (_N==37)
>>
>> use `"`fullauto'"' if _n<=37, clear
>> count
>> assert (_N==37)
>>
>> use `"`fullauto'"' if _n>37, clear
>> count
>> assert (_N==37)
>>
>> It is hard to understand what Stata will think of _n while loading
>> data, but it is definitely not the observation number.
>> Strangely the condition inrange(_n,1,20) loads 20 (twenty)
>> observations, but inrange(_n,2,20) loads 0 (zero).
>>
>> So if you ever try to work with large datasets in smaller portions,
>> slice them with an in-condition, not an if-condition!
>>
>> Stata MP for Windows, v10.1.551 born 02 Feb 2009, (currently latest.
>> This recent update brings some very welcomed changes: thank you!)
>>
>> Best regards, Sergiy Radyakin
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/statalist/faq
>> *   http://www.ats.ucla.edu/stat/stata/
>>
>
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index