Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: st: A bug in egen and gen?


From   "Sarah Edgington" <[email protected]>
To   <[email protected]>
Subject   RE: st: A bug in egen and gen?
Date   Thu, 17 Feb 2011 15:13:12 -0800

Junlin,
If it works for you then fine.  The problem is that doubling the size of the
dataset also doubles the amount of memory required to open it.  Judging by
the number of requests this list gets from people looking to increase the
amount of memory Stata can use on their systems, that's much more likely to
be a limiting factor than storage space.  Since, as Maarten notes, in most
cases you're just increasing the size of your data with noise, for many
users, particularly those with large data sets, your recommendation will
actually make using Stata harder not easier.
-Sarah

-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of Liao, Junlin
Sent: Thursday, February 17, 2011 2:48 PM
To: [email protected]
Subject: RE: st: A bug in egen and gen?

I took from Bill Gould's article on precision for two points:

1. Stata makes all calculations in double precision, and 2. float provides
more than enough precision for most data applications

What really matters comes down to storage. The experiments did not include
decimal numbers, therefore, the final storage space would not differ. The
point Nick tried to make is that not all numbers are integers, then compress
would not make any difference since compress does not convert decimal double
numbers to float type (even though it should when the recast does not change
the variable values). I looked at my data files. The largest of them all is
a dataset with 135MB. If I convert all float numbers to double, I saw an
increase of size 14% (to 155MB). That is indeed a significant waste of
storage. However, I would still argue for precision. The storage capacity
increase as fast as memory does. In fact I keep a few process variables for
convenience. If I drop those variables, my file size can be reduced to 129MB
with all decimal variable in double type and 119MB with all decimal in float
type (difference reduces to less than 10% now). I can reduce further the
size of my f!
 ile by getting rid of calculated fields. You can blame me as careless in
keeping my files. But in reality I have hundreds of giga bytes wasted every
day (sitting there idle) any way. My point is that the storage factor is not
that important realistically.

In the old days storage and memory mattered a lot. The programs were much
smaller and probably more efficient. The fast increase in computing power is
making the distinction of float and double type numbers increasingly
irrelevant. For Bill's view, float may indeed provide more than enough
precision for most data applications, however, if precision can be gained at
negligible cost--I seriously doubt anyone today running Stata have a
capacity constraint issue with their computers, why not? I do a lot of data
analysis; but still, before my next upgrade of computer, the possibility of
running out of disk space because I set my precision to double is zero.
That's why I choose to change the default to double and recommend others to
do so as well.

Tx,

Junlin

-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of Maarten buis
Sent: Thursday, February 17, 2011 3:09 PM
To: [email protected]
Subject: RE: st: A bug in egen and gen?

--- On Thu, 17/2/11, Liao, Junlin wrote:
> I just fail to see your point " You can get some of that back by 
> -compress-, but not all ". My experiment clearly proves that what 
> matters is the "final" storage data type. I understand that by using 
> double in place of float or long will increase requirement of memory.
> My point is that computing power is increasing exponentially. For 
> example, any computers I use have at least 4GB of memory.
> The machine I load with Stata has 8GB. Memory is least of my concerns, 
> but accuracy is always important.

If you store real data as double you are trying to regain accuracy that does
not exist in your data. All you have done is doubled the size of your file
to store random noise. -compress- will only help avoid this if your
variables are all integers.

There are situations where storing or generating variables as doubles make
sense, but they are the exception not the rule.

-- Maarten

--------------------------
Maarten L. Buis
Institut fuer Soziologie
Universitaet Tuebingen
Wilhelmstrasse 36
72074 Tuebingen
Germany

http://www.maartenbuis.nl
--------------------------




*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


________________________________
Notice: This UI Health Care e-mail (including attachments) is covered by the
Electronic Communications Privacy Act, 18 U.S.C. 2510-2521, is confidential
and may be legally privileged.  If you are not the intended recipient, you
are hereby notified that any retention, dissemination, distribution, or
copying of this communication is strictly prohibited.  Please reply to the
sender that you have received the message in error, then delete it.  Thank
you.
________________________________

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index