Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | "Sarah Edgington" <sedging@ucla.edu> |
To | <statalist@hsphsun2.harvard.edu> |
Subject | RE: RE: st: A bug in egen and gen? |
Date | Fri, 18 Feb 2011 10:36:44 -0800 |
Junlin, I can't speak to SPSS but SAS certainly does not have the same constraints as Stata around dataset size. In SAS you are not loading the entire dataset into memory at once. Thus, for SAS datasets the main concern really is disk space, which, it has been noted is cheap. The problems with load time and memory constraints associated with larger files won't bite you there the way they will with Stata. I will note that, in general, I strongly object to the insistence that the way you want something to behave is the way everyone should want it to behave or, indeed, the best practice. You recommended that everyone set their default numeric type to double. That's a fine solution for your problem. However, you are solving a problem that I have never felt the need to solve. I do, however, regularly work with social science data set large enough that even a 10% difference in size would often create problems for me. Plus, I am so frequently working with decimal values of the sort where storing them as doubles would add no additional real precision, while taking up twice as much space that I doubt most of my datasets would increase in size only by 10% if I followed your advice. Moreover I have worked on shared network systems where disk space per user is much more limited then working on a personal computer so, even were I not concerned about load times and memory usage, your recommended solution--to something that has never caused me problems--would cause all sorts of headaches in those cases. I'm glad that setting the default to double will solve a problem that you're having, but I think recommending that everyone go out and set their default to double ignores the great variety of ways that people use Stata. I respectfully counter-recommend that people think about the type of data they use, read the available information on numeric storage types, and adjust their options and code as necessary. As for your suggested change to Stata's behavior, perhaps I am missing something about your argument but I truly fail to see how Stata should "know" how you want your numeric data stored. Let's take the example of 4.1. It has already been shown that, from the computer's perspective, the float and double approximations of that value are not the same number. One could argue that no reasonable person would ever care about the difference in the two approximations and that Stata should obviously store the result as float. As soon as that's implemented, surely someone will come to the list insisting that their application really does require the precision of the double approximation. So perhaps Stata's gen and egen functions should always default to that type for approximations with no exact binary representation. Except then you'll have those of us who use large datasets up in arms because suddenly our calculations of means and other decimal values require much much more storage space, loading time, and memory than they did previously. You can't make everyone happy. The solution of defaulting to a variable type that will, in the majority of cases, be sufficiently precise while allowing the user to specify a more precise one if they need it, seems the perfect compromise. You have a particular problem that involves large integer numbers. You have found that the default behavior of Stata is insufficiently precise to meet your particular needs. Storing a number somewhat above 83 million in the two different ways gets you a difference of 3 between the types. You seem sure that your data is precise enough to begin with that this difference is a meaningful. For many purposes (other than ID numbers) there isn't even a substantive difference between 83,000,000 and 83,085,700. For those purposes there certainly is not a substantive difference between 83,085,733 and 83,085,736. But, for your purposes you think that a real difference. Fine. There is already a solution at hand for solving your problem now. Given that you don't have concerns about dataset size, you can also permanently set your default in a way that will prevent the problem from biting you again in the future. Why should Stata change in a way that will not benefit users who do not use data like yours (or who do not believe their data is measured so precisely that a .000003611% change in their results matters) and will indeed make life more difficult for many users when you already have a solution? I honestly fail to see what you're angry about? -Sarah -----Original Message----- From: owner-statalist@hsphsun2.harvard.edu [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of Liao, Junlin Sent: Friday, February 18, 2011 7:23 AM To: statalist@hsphsun2.harvard.edu Subject: RE: RE: st: A bug in egen and gen? My original post was about data input -gen- and -egen- functions. What happened was that when I input large numbers, I got wrong data. Then of course the data type setting comes into the play and followed by storage space issues. I think the sensible thing to do was for Stata to fix -gen- and -egen- commands where the users do not need to specify data type. Stata has already had the capability to do it. Stata calculates with double precision and had the correct answer at hand but presented with the wrong data type. It's a simple suggestion. Setting type to double is a compromise. However, this compromise is also necessary because other data importing procedures depends on it. I do not have datasets in millions. But I occasionally run dataset with observations in hundreds of thousands. Yet I fail to see the advantage of saving 10% of storage space and memory. I do experience memory constraint in analysis. But I do not think 10% dataset size reduction can do anything about it. It may be different for someone who runs financial analysis though. You may have most of the variables in decimal and double could very well inflate your dataset to double its size. But I doubt the majority of Stata users are not in that camp. When we talk about best practice, I think there is best practice for the industry as well. I tested SAS, SPSS, and MS Access. None of them has the problem. MS Access as personal database always defaults to double. SAS and SPSS only have a numeric data type. They all can get the numbers accurately without additional user input. Shouldn't they care about dataset size as well? I think they do. There may be a valid point to argue for float where double would give you higher precision at the expense of storage space, however, my original problem is Stata setting float type for what should be long integer. * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/