Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: RE: st: A bug in egen and gen?

From	"Sarah Edgington" <[email protected]>
To	<[email protected]>
Subject	RE: RE: st: A bug in egen and gen?
Date	Fri, 18 Feb 2011 10:36:44 -0800

Junlin,
I can't speak to SPSS but SAS certainly does not have the same constraints
as Stata around dataset size.  In SAS you are not loading the entire dataset
into memory at once.  Thus, for SAS datasets the main concern really is disk
space, which, it has been noted is cheap. The problems with load time and
memory constraints associated with larger files won't bite you there the way
they will with Stata.

I will note that, in general, I strongly object to the insistence that the
way you want something to behave is the way everyone should want it to
behave or, indeed, the best practice.  You recommended that everyone set
their default numeric type to double.  That's a fine solution for your
problem.  However, you are solving a problem that I have never felt the need
to solve.  I do, however, regularly work with social science data set large
enough that even a 10% difference in size would often create problems for
me.  Plus, I am so frequently working with decimal values of the sort where
storing them as doubles would add no additional real precision, while taking
up twice as much space that I doubt most of my datasets would increase in
size only by 10% if I followed your advice.  Moreover I have worked on
shared network systems where disk space per user is much more limited then
working on a personal computer so, even were I not concerned about load
times and memory usage, your recommended solution--to something that has
never caused me problems--would cause all sorts of headaches in those cases.
I'm glad that setting the default to double will solve a problem that you're
having, but I think recommending that everyone go out and set their default
to double ignores the great variety of ways that people use Stata.  I
respectfully counter-recommend that people think about the type of data they
use, read the available information on numeric storage types, and adjust
their options and code as necessary.

As for your suggested change to Stata's behavior, perhaps I am missing
something about your argument but I truly fail to see how Stata should
"know" how you want your numeric data stored.  Let's take the example of
4.1.  It has already been shown that, from the computer's perspective, the
float and double approximations of that value are not the same number.  One
could argue that no reasonable person would ever care about the difference
in the two approximations and that Stata should obviously store the result
as float.  As soon as that's implemented, surely someone will come to the
list insisting that their application really does require the precision of
the double approximation.  So perhaps Stata's gen and egen functions should
always default to that type for approximations with no exact binary
representation.  Except then you'll have those of us who use large datasets
up in arms because suddenly our calculations of means and other decimal
values require much much more storage space, loading time, and memory than
they did previously.  You can't make everyone happy.  The solution of
defaulting to a variable type that will, in the majority of cases, be
sufficiently precise while allowing the user to specify a more precise one
if they need it, seems the perfect compromise.  

You have a particular problem that involves large integer numbers.  You have
found that the default behavior of Stata is insufficiently precise to meet
your particular needs.  Storing a number somewhat above 83 million in the
two different ways gets you a difference of 3 between the types. You seem
sure that your data is precise enough to begin with that this difference is
a meaningful.  For many purposes (other than ID numbers) there isn't even a
substantive difference between 83,000,000 and 83,085,700.  For those
purposes there certainly is not a substantive difference between 83,085,733
and 83,085,736. But, for your purposes you think that a real difference.
Fine.  There is already a solution at hand for solving your problem now.
Given that you don't have concerns about dataset size, you can also
permanently set your default in a way that will prevent the problem from
biting you again in the future.  Why should Stata change in a way that will
not benefit users who do not use data like yours (or who do not believe
their data is measured so precisely that a .000003611% change in their
results matters) and will indeed make life more difficult for many users
when you already have a solution?  I honestly fail to see what you're angry
about?

-Sarah

-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of Liao, Junlin
Sent: Friday, February 18, 2011 7:23 AM
To: [email protected]
Subject: RE: RE: st: A bug in egen and gen?

My original post was about data input -gen- and -egen- functions. What
happened was that when I input large numbers, I got wrong data. Then of
course the data type setting comes into the play and followed by storage
space issues. I think the sensible thing to do was for Stata to fix -gen-
and -egen- commands where the users do not need to specify data type. Stata
has already had the capability to do it. Stata calculates with double
precision and had the correct answer at hand but presented with the wrong
data type. It's a simple suggestion. Setting type to double is a compromise.
However, this compromise is also necessary because other data importing
procedures depends on it.

I do not have datasets in millions. But I occasionally run dataset with
observations in hundreds of thousands. Yet I fail to see the advantage of
saving 10% of storage space and memory. I do experience memory constraint in
analysis. But I do not think 10% dataset size reduction can do anything
about it. It may be different for someone who runs financial analysis
though. You may have most of the variables in decimal and double could very
well inflate your dataset to double its size. But I doubt the majority of
Stata users are not in that camp.

When we talk about best practice, I think there is best practice for the
industry as well. I tested SAS, SPSS, and MS Access. None of them has the
problem. MS Access as personal database always defaults to double. SAS and
SPSS only have a numeric data type. They all can get the numbers accurately
without additional user input. Shouldn't they care about dataset size as
well? I think they do. There may be a valid point to argue for float where
double would give you higher precision at the expense of storage space,
however, my original problem is Stata setting float type for what should be
long integer.

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- RE: RE: st: A bug in egen and gen?
  - From: "Liao, Junlin" <[email protected]>

References:
- re: RE: st: A bug in egen and gen?
  - From: Christopher Baum <[email protected]>
- RE: RE: st: A bug in egen and gen?
  - From: "Liao, Junlin" <[email protected]>

Prev by Date: Re: st: RE: RE: RE: how to add error bars to each data marker on a line graph (or the size of the data marker reflect the confidence interval)?
Next by Date: RE: RE: st: A bug in egen and gen?
Previous by thread: RE: RE: st: A bug in egen and gen?
Next by thread: RE: RE: st: A bug in egen and gen?
Index(es):
- Date
- Thread