Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: RE: st: A bug in egen and gen?


From   "Liao, Junlin" <junlin-liao@uiowa.edu>
To   "statalist@hsphsun2.harvard.edu" <statalist@hsphsun2.harvard.edu>
Subject   RE: RE: st: A bug in egen and gen?
Date   Fri, 18 Feb 2011 19:02:15 +0000

Sarah,

I think we have passed the page on double vs. float. As long as we can set our own options, it's fine with everyone. I have not been angry about anything. I had just voiced my opinion. While people have been trying to educate me on the difference of double and float, they generally neglect the real problem/challenge/nuisance with Stata commands. My original question was about inaccuracy of large integers because Stata stored it in float type. There is no size savings like that with double vs. float. The float and double issue comes up later and I only voice my perspective, particularly relevant for people working with smaller dataset (less than millions observations) and only a handful decimal numeric variables. I can see why people want to default to float. I wish others can appreciate my selection of double as well. There is really no arguing there. I'm only defending my choice while being educated that my selection is a bad one. I hope our discussion can focus on the orig!
 inal issue for now. Isn't it nice that Stata would store large integer as long instead of inaccurately as float? Well, I do not need to be reminded again that I can force it to be long. My point is that Stata can be smarter, at this particular issue.

Tx,

Junlin

-----Original Message-----
From: owner-statalist@hsphsun2.harvard.edu [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of Sarah Edgington
Sent: Friday, February 18, 2011 12:37 PM
To: statalist@hsphsun2.harvard.edu
Subject: RE: RE: st: A bug in egen and gen?

Junlin,
I can't speak to SPSS but SAS certainly does not have the same constraints as Stata around dataset size.  In SAS you are not loading the entire dataset into memory at once.  Thus, for SAS datasets the main concern really is disk space, which, it has been noted is cheap. The problems with load time and memory constraints associated with larger files won't bite you there the way they will with Stata.

I will note that, in general, I strongly object to the insistence that the way you want something to behave is the way everyone should want it to behave or, indeed, the best practice.  You recommended that everyone set their default numeric type to double.  That's a fine solution for your problem.  However, you are solving a problem that I have never felt the need to solve.  I do, however, regularly work with social science data set large enough that even a 10% difference in size would often create problems for me.  Plus, I am so frequently working with decimal values of the sort where storing them as doubles would add no additional real precision, while taking up twice as much space that I doubt most of my datasets would increase in size only by 10% if I followed your advice.  Moreover I have worked on shared network systems where disk space per user is much more limited then working on a personal computer so, even were I not concerned about load times and memory usage, you!
 r recommended solution--to something that has never caused me problems--would cause all sorts of headaches in those cases.
I'm glad that setting the default to double will solve a problem that you're having, but I think recommending that everyone go out and set their default to double ignores the great variety of ways that people use Stata.  I respectfully counter-recommend that people think about the type of data they use, read the available information on numeric storage types, and adjust their options and code as necessary.

As for your suggested change to Stata's behavior, perhaps I am missing something about your argument but I truly fail to see how Stata should "know" how you want your numeric data stored.  Let's take the example of 4.1.  It has already been shown that, from the computer's perspective, the float and double approximations of that value are not the same number.  One could argue that no reasonable person would ever care about the difference in the two approximations and that Stata should obviously store the result as float.  As soon as that's implemented, surely someone will come to the list insisting that their application really does require the precision of the double approximation.  So perhaps Stata's gen and egen functions should always default to that type for approximations with no exact binary representation.  Except then you'll have those of us who use large datasets up in arms because suddenly our calculations of means and other decimal values require much much more st!
 orage space, loading time, and memory than they did previously.  You can't make everyone happy.  The solution of defaulting to a variable type that will, in the majority of cases, be sufficiently precise while allowing the user to specify a more precise one if they need it, seems the perfect compromise.

You have a particular problem that involves large integer numbers.  You have found that the default behavior of Stata is insufficiently precise to meet your particular needs.  Storing a number somewhat above 83 million in the two different ways gets you a difference of 3 between the types. You seem sure that your data is precise enough to begin with that this difference is a meaningful.  For many purposes (other than ID numbers) there isn't even a substantive difference between 83,000,000 and 83,085,700.  For those purposes there certainly is not a substantive difference between 83,085,733 and 83,085,736. But, for your purposes you think that a real difference.
Fine.  There is already a solution at hand for solving your problem now.
Given that you don't have concerns about dataset size, you can also permanently set your default in a way that will prevent the problem from biting you again in the future.  Why should Stata change in a way that will not benefit users who do not use data like yours (or who do not believe their data is measured so precisely that a .000003611% change in their results matters) and will indeed make life more difficult for many users when you already have a solution?  I honestly fail to see what you're angry about?

-Sarah

-----Original Message-----
From: owner-statalist@hsphsun2.harvard.edu
[mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of Liao, Junlin
Sent: Friday, February 18, 2011 7:23 AM
To: statalist@hsphsun2.harvard.edu
Subject: RE: RE: st: A bug in egen and gen?

My original post was about data input -gen- and -egen- functions. What happened was that when I input large numbers, I got wrong data. Then of course the data type setting comes into the play and followed by storage space issues. I think the sensible thing to do was for Stata to fix -gen- and -egen- commands where the users do not need to specify data type. Stata has already had the capability to do it. Stata calculates with double precision and had the correct answer at hand but presented with the wrong data type. It's a simple suggestion. Setting type to double is a compromise.
However, this compromise is also necessary because other data importing procedures depends on it.

I do not have datasets in millions. But I occasionally run dataset with observations in hundreds of thousands. Yet I fail to see the advantage of saving 10% of storage space and memory. I do experience memory constraint in analysis. But I do not think 10% dataset size reduction can do anything about it. It may be different for someone who runs financial analysis though. You may have most of the variables in decimal and double could very well inflate your dataset to double its size. But I doubt the majority of Stata users are not in that camp.

When we talk about best practice, I think there is best practice for the industry as well. I tested SAS, SPSS, and MS Access. None of them has the problem. MS Access as personal database always defaults to double. SAS and SPSS only have a numeric data type. They all can get the numbers accurately without additional user input. Shouldn't they care about dataset size as well? I think they do. There may be a valid point to argue for float where double would give you higher precision at the expense of storage space, however, my original problem is Stata setting float type for what should be long integer.

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


________________________________
Notice: This UI Health Care e-mail (including attachments) is covered by the Electronic Communications Privacy Act, 18 U.S.C. 2510-2521, is confidential and may be legally privileged.  If you are not the intended recipient, you are hereby notified that any retention, dissemination, distribution, or copying of this communication is strictly prohibited.  Please reply to the sender that you have received the message in error, then delete it.  Thank you.
________________________________

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index