Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: RE: Recode - a cautionary tale


From   "Nick Cox" <[email protected]>
To   <[email protected]>
Subject   st: RE: Recode - a cautionary tale
Date   Thu, 17 Sep 2009 13:34:30 +0100

I agree with Allan that there's a cautionary tale here, but I am not
completely sure what Allan thinks it is, so let me try to summarize. 

First, let's underline, especially given Allan's title, that the
difficulties arose in using the -recode()- function, not the -recode-
command. (Some users, presumably because of experience outside Stata and
variations in terminology between software, are a little fuzzy on the
difference between commands and functions.) 

Second, Allan's colleague got bit because 

gen byte LHS = <RHS> 

maps <RHS> that evaluates to 101 or more to missing. The punishment
here, unfortunately, was that she was given what she asked for, namely a
byte variable, with its own (documented) limits. (Stata's pretty weak on
"Are you sure?" messages.) 

As this would have happened regardless of what the <RHS> was, singling
out the -recode()- function is hardly the key issue. 

Incidentally, I would always prefer to round explicitly using -floor()-
or -ceil()- because then I know without looking at any documentation --
and can control -- exactly what the limits are. (That "floor" means down
and "ceil" means up is something I can carry in my head.) Thus 

20 * floor(y/20) 

rounds down and 

20 * ceil(y/20) 

rounds up, both in intervals of 20. However, it is easy to see that
others may well prefer the flexibility of -recode()- or -egen, cut()-. 

Nick 
[email protected] 

Allan Reese (Cefas)

A colleague used the recode function, following the example in
[U]25.1.2.  It reported some missing values, but she knew there were
some missing items.  Unfortunately some actual values also got recoded
as missing.
The command was:
  gen byte xcat = recode( x, 20, 40, 60, 80, 100, 120) and the missing
values should have been 120.  

[U]12.2.2 lists the ranges for each numeric type, which for byte is -127
to +100, but does not specify what should happen when an out of range
value is assigned. I've never had this problem because I'm too idle to
save a few bytes by specifying the type. ;-)

Tech support point out that if you don't force Stata to use a "byte"
then it will gracefully detect the out of range values and automatically
promote to the correct storage type. "But when you specify -generate
byte- you are using the advanced syntax and telling Stata that you
really want it to stay a byte no matter what values you pass it." In my
opinion the advice in 25.1.2 is too Delphic, and the comment that "we
(wisely) told Stata to generate the new variable as a byte" can be
deleted.
 
. clear

. set obs 3
obs was 0, now 3

. generate byte x = _n

. replace x = x + 200
x was byte now int
(3 real changes made)

. replace x = x + 40000
x was int now long
(3 real changes made)

. replace x = x + .5
x was long now double
(3 real changes made)

In giving advice, I had been thinking of the recode command rather than
the function: the command makes it easier to handle end intervals with
min/max.  Another option is egen using cut() which also allows
substitution of integer codes labelled with the cutpoint values.  Using
icodes makes it less likely the byte storage will be overflowed.

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index