[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: RE: RE: RE: Efficient coding with -replace-

From	"Elizabeth Allred" <[email protected]>
To	<[email protected]>
Subject	st: RE: RE: RE: Efficient coding with -replace-
Date	Sun, 05 Oct 2008 17:20:52 -0400

Hi Martin,

I'm assuming that Michael is making corrections to a data set he's worked with, and will work with, for some time--and that these particular fields are not modified in analysis. He's discovered the problems through updates from his data collectors or by reviewing distributions. Perhaps he's found that the month and year of birth for a subject were incorrect. Perhaps someone forgot to enter the value for failed. This sort of thing comes up frequently in the biomedical world.

Liz

>>> On 10/5/2008 at 3:09 PM, in message
<000501c9271d$d8723cf0$8956b6d0$@[email protected]>, "Martin Weiss"
<[email protected]> wrote:
> Could you explain this a little more extensively, Liz? How do you know in
> advance what you are changing from? Might it not be different a couple of
> months down the road? Or are you hinting at something else?
> 
> 
> HTH
> Martin
> 
> 
> -----Original Message-----
> From: [email protected] 
> [mailto:[email protected]] On Behalf Of Elizabeth Allred
> Sent: Sunday, October 05, 2008 8:39 PM
> To: [email protected] 
> Subject: st: RE: RE: Efficient coding with -replace-
> 
> More important than efficiency, I think, the do file is the document of your
> editing. The code referencing the id will be easy to understand when you
> look at it 6 months from now. I might even go one step further and include
> what you're changing FROM:
> 
> replace month = 1 if id==80 & month==4
> replace year =  1996 if id==80 & year==1995
> replace failed= 1 if id==80 & failed==.
> 
> Liz
> 
>>>> On 10/5/2008 at 12:22 PM, in message
> <031173627889364697C50B3B266CBB8A01C08BB8@GEOGMAIL.geog.ad.dur.ac.uk>, "Nick
> Cox" <[email protected]> wrote:
>> Not so, or at least, it's more complicated than that. 
>> 
>> My short answer: On this information, Michael should leave his code as
>> is. 
>> 
>> My longer answer: 
>> 
>> First of all, the indirection of using a local macro is more or less
>> irrelevant to efficiency. In fact, if you recode as Martin suggested,
>> the code will be a smidgen _slower_, as Stata is obliged to store the
>> macro and then interpret it each time it is referenced. However, you
>> would have to strain to tell the difference in timings. But remember:
>> Stata is not a compiler! Interpretation always implies an overhead, just
>> that in many cases it is negligible. 
>> 
>> On a style point, I would not use a local macro in this example. I can't
>> see what real gain there is in terms of making the code more readable or
>> comprehensible, setting aside the efficiency issue. 
>> 
>> On a larger issue, -if- is always less efficient than an equivalent -in-
>> when there is a direct mapping between statements. What do I mean by
>> that? 
>> 
>> Suppose you know that there is a single observation, say 5890, for which
>> -id- is 80. 
>> 
>> Then you could and should code 
>> 
>> replace month = 1 in 5890
>> replace year =  1996 in 5890
>> replace failed= 1 in 5890
>> 
>> if efficiency were your only concern. Given a qualifier, -in 5890-,
>> Stata goes straight there, does the work, and bails out. Given a
>> qualifier, say -if id == 80-, Stata respects it the slow and stupid way
>> and tests every observation to see whether that condition is true or
>> false. (It never does the sort of smart thing that people are good at,
>> such as noticing whenever observations are ordered by -id- and taking
>> that into account.) So, for equivalent actions, -if- is much slower than
>> -in-.
>> 
>> This principle is sometimes codified on Statalist, tongue in cheek, as
>> Blasnik's Law, because Michael Blasnik has done more than anyone else to
>> publicise it. 
>> 
>> However, 
>> 
>> 1. Efficiency should never be your only concern. Code with -if id == 80-
>> is much more transparent than code with -in 5890-. Also, get the
>> observation number wrong or mess up the sort order and you have
>> introduced a hard-to-find bug. 
>> 
>> 2. The "suppose" is a big one. How do you find out the observation
>> number if you don't know? You could do something like this 
>> 
>> gen long id = _n 
>> su id if id == 80, meanonly  
>> assert r(min) == r(max) 
>> local where = r(min) 
>> replace month = 1 in `where' 
>> 
>> etc. 
>> 
>> But you can see there is a trade-off here. You have to do more work
>> beforehand to save work! In practice I would be most unlikely to bother.
>> In general being clever like this will not help much and might involve
>> extra work. Spending 2 minutes changing the code for 2 ms less machine
>> time is usually dopey unless you know that you are going to use that
>> code many, many times. 
>> 
>> 3. I've taken Michael literally in his implication that only a single
>> observation is involved. The test above 
>> 
>> assert r(min) == r(max) 
>> 
>> tests whether that is so. 
>> 
>> At worst, the observations satisfying the -if- don't occur in a single
>> block so that -in- is not applicable to the data as they stand. (In
>> principle, that is always fixed by -sort-ing. Again in practice, there
>> is a trade-off in that -sort-ing may take up considerable machine time
>> itself.) 
>> 
>> Nick
>> [email protected] 
>> 
>> (In a later post, Martin introduced what I think is another red herring
>> by talking about dialogs. If you care about machine time, don't use
>> dialogs.) 
>> 
>> Martin Weiss
>> 
>> -replace- expects "oldvar =exp", so no, I do not think there is a more
>> efficient way. Multiple instances of the same -if- qualifier always make
>> it
>> advisable to throw it into a -local- 
>> 
>> local mycond " if id==80"
>> replace month = 1 `mycond'
>> replace year =  1996 `mycond'
>> replace failed= 1 `mycond'
>> 
>> Michael McCulloch
>> 
>> As part of a data audit, I'm recording some changes in my project 
>> do-file. Would there be a more efficient way to code the following 
>> changes, all of which involve the same observation?
>> 
>> replace month = 1 if id==80
>> replace year =  1996 if id==80
>> replace failed= 1 if id==80
>> 
>> 
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search 
>> *   http://www.stata.com/support/statalist/faq 
>> *   http://www.ats.ucla.edu/stat/stata/ 
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search 
> *   http://www.stata.com/support/statalist/faq 
> *   http://www.ats.ucla.edu/stat/stata/ 
> 
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search 
> *   http://www.stata.com/support/statalist/faq 
> *   http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- st: Re: RE: RE: RE: Efficient coding with -replace-
  - From: "Martin Weiss" <[email protected]>

References:
- st: Efficient coding with -replace-
  - From: Michael McCulloch <[email protected]>
- st: RE: Efficient coding with -replace-
  - From: "Martin Weiss" <[email protected]>
- st: RE: RE: Efficient coding with -replace-
  - From: "Nick Cox" <[email protected]>
- st: RE: RE: Efficient coding with -replace-
  - From: "Elizabeth Allred" <[email protected]>
- st: RE: RE: RE: Efficient coding with -replace-
  - From: "Martin Weiss" <[email protected]>

Prev by Date: Re: st: Mediating variables
Next by Date: st: Re: RE: RE: RE: Efficient coding with -replace-
Previous by thread: st: RE: RE: RE: Efficient coding with -replace-
Next by thread: st: Re: RE: RE: RE: Efficient coding with -replace-
Index(es):
- Date
- Thread