Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: RE: RE: egen and spontaneously changing numbers


From   "Nick Cox" <n.j.cox@durham.ac.uk>
To   <statalist@hsphsun2.harvard.edu>
Subject   st: RE: RE: egen and spontaneously changing numbers
Date   Wed, 20 May 2009 17:59:48 +0100

Some readings: 

FAQ     . . . . . . . . . . . . . . . . . . . Results of the mod(x,y)
function
        . . . . . . . . . . . . . . . . . . . . . N. J. Cox and T. J.
Steichen
        2/03    Why does the mod(x,y) function sometimes give
                puzzling results?
                Why is mod(0.3,0.1) not equal to 0?
                http://www.stata.com/support/faqs/data/mod.html

FAQ     . . . . . . . . . . . . . . . . .  The accuracy of the float
data type
        . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . W.
Gould
        5/01    How many significant digits are there in a float?
                http://www.stata.com/support/faqs/data/prec.html

FAQ     . . . . . . . . .  Why am I losing precision with large whole
numbers?
        . . . . . . . . . . . . . . . . . .  UCLA Academic Technology
Services
        7/08    http://www.ats.ucla.edu/stat/stata/faq/longid.htm

SJ-8-2  pr0038  Mata Matters: Overflow, underflow & IEEE floating-point
format
        . . . . . . . . . . . . . . . . . . . . . . . . . . . .  J. M.
Linhart
        Q2/08   SJ 8(2):255--268                                 (no
commands)
        focuses on underflow and overflow and details of how
        floating-point numbers are stored in the IEEE 754
        floating-point standard

SJ-6-4  pr0025  . . . . . . . . . . . . . . . . . . .  Mata matters:
Precision
        . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . W.
Gould
        Q4/06   SJ 6(4):550--560                                 (no
commands)
        looks at programming implications of the floating-point,
        base-2 encoding that modern computers use

SJ-6-2  dm0022  . Tip 33: Sweet sixteen: Hexadec. formats & precision
problems
        . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  N.
J. Cox
        Q2/06   SJ 6(2):282--283                                 (no
commands)
        tip for using hexadecimal formats to understand precision
        problems in Stata

Nick 
n.j.cox@durham.ac.uk 

Nick Cox

You have a precision problem. By default -egen- will generate -float-
variables with the functions you are using. To keep every digit in the
integers you are playing with you need to spell out that you want a
-long- or -double-. There aren't enough bits in the variable type you
are using. 

I can't follow your code which seems to go back and forth between string
and numeric results, nor do I know what MEPS means. I guess there's a
much simpler way to do what you want without using -egen- at all, but
the issue that is biting you is illustrated thus: 

. set obs 1
obs was 0, now 1

. gen long myin  = 40002015

. egen myout = max(myin)

. egen long myout2 = max(myin)

. format myout* %12.0f

. l

     +--------------------------------+
     |     myin      myout     myout2 |
     |--------------------------------|
  1. | 40002015   40002016   40002015 |
     +--------------------------------+

Nick 
n.j.cox@durham.ac.uk 

Matt Rutledge

Using Stata 10, I'm attempting to assign one person's identifier  
(DUPERSID from the MEPS dataset) to every person in the sample,  
repeating for each of the N people in my sample.  The code I'm using  
seems to work, except that it spontaneously changes one digit of the  
identifier.

To illustrate, I've created this dummy dataset:
dupersid	date	x
40002015	19990101	1
40002015	19990201	0
40002015	19990301	0
40010010	19990101	0
40010010	19990201	1
40010010	19990301	1
41011144	19990101	1
41011144	19990201	0
41011144	19990301	1
and called it test.txt.

I then read in this dataset, and attempt to assign each observation  
the dupersid 40002015.  In turn, I'll also want to assign all of them  
the identifier 40010010, and finally 41011144.  So I do a forvalues  
loop:

set more off
insheet using test.txt, names
tostring dupersid, replace
rename dupersid dupersidsave
bysort dupersidsave: gen first = 1 if _n==1
replace first = 0 if first==.
summ first
local N = r(N)*r(mean)
forvalues j = 1/`N' {
	preserve
	gsort -first dupersidsave
	gen dupers = dupersidsave if _n==`j' & first==1
	destring dupers, replace
	egen dupersid = max(dupers)
	tostring dupersid, replace
	gsort dupersidsave -first
	list dupers*
	des
	restore
	}

****
Here's the output.  Please note that on the first pass through the  
loop, the identifier changes from 40002015 to 40002016.  On the second  
pass, the identifier changes from 40010010 to 40010008.  The third  
pass is fine.  Any ideas why this might be?  Using "egen, total" or  
"egen, mean" doesn't seem to help, nor does destringing the identifier  
at different points along the way.  Also, I get the same error running  
it without a loop (replace `j' with 1, for instance, and the  
identifier still spontaneously changes).

(8 real changes made, 8 to missing)
dupers already numeric; no replace
dupersid was float now str8

      +--------------------------------+
      | dupers~e     dupers   dupersid |
      |--------------------------------|
   1. | 40002015   4.00e+07   40002016 |
   2. | 40002015          .   40002016 |
   3. | 40002015          .   40002016 |
   4. | 40010010          .   40002016 |
   5. | 40010010          .   40002016 |
      |--------------------------------|
   6. | 40010010          .   40002016 |
   7. | 41011144          .   40002016 |
   8. | 41011144          .   40002016 |
   9. | 41011144          .   40002016 |
      +--------------------------------+

Contains data
   obs:             9
  vars:             6
  size:           261 (99.9% of memory free)
------------------------------------------------------------------------
-------
               storage  display     value
variable name   type   format      label      variable label
------------------------------------------------------------------------
-------
dupersidsave    long   %12.0g
date            long   %12.0g
x               byte   %8.0g
first           float  %9.0g
dupers          float  %9.0g
dupersid        str8   %9s
------------------------------------------------------------------------
-------
Sorted by:  dupersidsave
      Note:  dataset has changed since last saved
(8 real changes made, 8 to missing)
dupers already numeric; no replace
dupersid was float now str8

      +--------------------------------+
      | dupers~e     dupers   dupersid |
      |--------------------------------|
   1. | 40002015          .   40010008 |
   2. | 40002015          .   40010008 |
   3. | 40002015          .   40010008 |
   4. | 40010010   4.00e+07   40010008 |
   5. | 40010010          .   40010008 |
      |--------------------------------|
   6. | 40010010          .   40010008 |
   7. | 41011144          .   40010008 |
   8. | 41011144          .   40010008 |
   9. | 41011144          .   40010008 |
      +--------------------------------+

Contains data
   obs:             9
  vars:             6
  size:           261 (99.9% of memory free)
------------------------------------------------------------------------
-------
               storage  display     value
variable name   type   format      label      variable label
------------------------------------------------------------------------
-------
dupersidsave    long   %12.0g
date            long   %12.0g
x               byte   %8.0g
first           float  %9.0g
dupers          float  %9.0g
dupersid        str8   %9s
------------------------------------------------------------------------
-------
Sorted by:  dupersidsave
      Note:  dataset has changed since last saved
(8 real changes made, 8 to missing)
dupers already numeric; no replace
dupersid was float now str8

      +--------------------------------+
      | dupers~e     dupers   dupersid |
      |--------------------------------|
   1. | 40002015          .   41011144 |
   2. | 40002015          .   41011144 |
   3. | 40002015          .   41011144 |
   4. | 40010010          .   41011144 |
   5. | 40010010          .   41011144 |
      |--------------------------------|
   6. | 40010010          .   41011144 |
   7. | 41011144   4.10e+07   41011144 |
   8. | 41011144          .   41011144 |
   9. | 41011144          .   41011144 |
      +--------------------------------+

Contains data
   obs:             9
  vars:             6
  size:           261 (99.9% of memory free)
------------------------------------------------------------------------
-------
               storage  display     value
variable name   type   format      label      variable label
------------------------------------------------------------------------
-------
dupersidsave    long   %12.0g
date            long   %12.0g
x               byte   %8.0g
first           float  %9.0g
dupers          float  %9.0g
dupersid        str8   %9s
------------------------------------------------------------------------
-------
Sorted by:  dupersidsave
      Note:  dataset has changed since last saved


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index