Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: st: A bug in egen and gen?


From   "Liao, Junlin" <junlin-liao@uiowa.edu>
To   "statalist@hsphsun2.harvard.edu" <statalist@hsphsun2.harvard.edu>
Subject   RE: st: A bug in egen and gen?
Date   Fri, 18 Feb 2011 02:43:04 +0000

Unless someone changed the command, my version of Stata 11 does not compress double to float when it can do so. This is also indicated in the documentation. However, I can recast a variable to float if there is no loss of precision, otherwise I have to put in the /force option to force convert with loss of precision.

To have a default option to compress and save has its advantages. Of course I can program it. But that would not be necessary. If I remember to run the command I create, I can simply type in -compress- command. My point is for user convenience.

The same can be said of double vs. float issue. True, Stata does what you ask it to do. But do I really need to put all parameters around what I need? Do I need to tell Stata that I want the data stored as double when double is the only choice to accurately record the data? I do not think so. That's why I commented that it is necessary to improve Stata.

Precision is not noise. At most it's simply waste of storage space. But it does not detract from the value of the dataset. Noise is a negative thing. As I pointed out, practically there is only about 10% to gain and this gain can be easily compensated with more careful management of data.

One reason I make such a recommendation is how easily it could be achieved technically. If -compress- is improved, then a simple automatic call of this command will do the job. The user can write his own program to do ANOVA, should he do that? When Stata encounters a request by users, it should do the job up to its capability. In terms of importance, I think precision is in the domain of quality and storage is in the domain of quantity. I just cannot bring myself to agree with the proposition that we need to sacrifice precision for storage, let alone such a trade-off does not exist where I encountered this problem in the first place.

Because double vs. float can be solved by setting preferences, I do not include it in my commendations. but I do want to show my logic here.

thanks,

Junlin
________________________________________
From: owner-statalist@hsphsun2.harvard.edu [owner-statalist@hsphsun2.harvard.edu] on behalf of Nick Cox [njcoxstata@gmail.com]
Sent: Thursday, February 17, 2011 7:20 PM
To: statalist@hsphsun2.harvard.edu
Subject: Re: st: A bug in egen and gen?

The term "noise" was I think used by Maarten, but I agree with him and
Bill Gould on this issue.

It is unnecessary to suggest that -compress- maps -double-s to
-float-s when there is  no loss of precision, because it will already
do so. In fact it will go all the way down to -byte- if there is no
loss of precision.

I don't think that adding a -compress- option to -save- is a
particularly important addition, because I already showed how you can
do that for yourself, but it would do no harm and it is an idea for
StataCorp to consider.

I think you are muddying the waters by raising issues of whether
people are being defensive of Stata, and so forth. That's not
material. For example, I have pointed out many mistakes in Stata
programs and documentation over the years, and I never defend
something I think is wrong. Also, I did not suggest that you have an
"attitude problem". The words and the inference are yours alone. My
point, standard in computing, is that a program does what you ask, not
what you want. If you want -double- results for -generate- or -egen-,
you merely have to ask for them in Stata, and this has been true for
many years. This is all apparent from reading the documentation. As I
pointed out in my first reply, you can program other possibilities for
yourself.

Nick

On Fri, Feb 18, 2011 at 12:50 AM, Liao, Junlin <junlin-liao@uiowa.edu> wrote:
> Sarah,
>
> I hope my discussion is making contribution instead of detraction to Stata. Most of the users subscribed to this list are somewhat committed to Stata. We all wishes improvement and perfection of Stata. But somehow I get a sense that it's not easy for people from Stata to recognize perspectives raised by users. I have suggested some feature changes. 1. have an option for user to compress data before saving data files; 2. the -compress- command should recast double to float when variable values will not change. I hope someone is taking note of such issues.
>
> In terms of default to double or float, I'm voicing what I see as the other side of coin. Practically I see no relevance of memory and storage. There is a strong reason behind this. Other than the exponential increase in computing capacity, there is a clear logic favoring my preference. Suppose memory and storage issues do arise in practice. I need to adjust memory allocated to Stata from time to time. So I know I have such a problem. It gets corrected. How about instances not in "most of time" that a double is required but a float is used? Will I get a warning that Stata generated inaccurate results? No. I would assume everything is OK and proceed with analysis with wrong results. Nick used the word "noise" to describe the gain in precision in double data type. I would not consider that as appropriate. It's definitely not noise. Noise is unwanted and precision is desirable.
>
> We need to come back to Nick's first reply. He suggested that I had a attitude problem by demanding two much of Stata. Really? Per Mr. Gould's article about precision, we know for sure that double precision is what Stata used for calculation--if this was not true, I would be worrying about many statistical reports I generated before. So, we know Stata has the correct data at hand (for the -gen- and -egen- commands). However, because of data settings, Stata chooses to store it in a wrong type of data. Should Stata figure out the correct data type? It's a matter of opinion. But can Stata figure out the correct data type to store numeric values? My knowledge and experience of Stata are limited. But I'm 100% confident that Stata or programmers of Stata can. It's simply a matter of willingness and attitude. With a defiant attitude, Stata cannot be that "smart". My no. 3 suggestion is indeed to make the -gen- and -egen- commands smarter to be able to figure the CORRECT data type!
 --!
>  if there is "noise", this suggestion will get rid of it.
>
> Let's talk about storage for a change. If Stata is so keen in keeping the files small and usage of memory maximized, I do have another suggestion to increase efficiency at no cost of precision. In my experience, I have many variables stored as 0/1. The smallest data type for it in Stata is Byte. One byte is 8 bits. However, for this type of data, we only need 2 bits (four times less). Will Stata be interested creating a boolean type data? I'm not trying to suggest that Stata do need to create such a data type. I'm highlighting the point that storage may be too much of an excuse for not accepting suggestions for improvement.
>
> Best wishes
>
> Junlin
>
> ________________________________________
> From: owner-statalist@hsphsun2.harvard.edu [owner-statalist@hsphsun2.harvard.edu] on behalf of Sarah Edgington [sedging@ucla.edu]
> Sent: Thursday, February 17, 2011 5:13 PM
> To: statalist@hsphsun2.harvard.edu
> Subject: RE: st: A bug in egen and gen?
>
> Junlin,
> If it works for you then fine.  The problem is that doubling the size of the
> dataset also doubles the amount of memory required to open it.  Judging by
> the number of requests this list gets from people looking to increase the
> amount of memory Stata can use on their systems, that's much more likely to
> be a limiting factor than storage space.  Since, as Maarten notes, in most
> cases you're just increasing the size of your data with noise, for many
> users, particularly those with large data sets, your recommendation will
> actually make using Stata harder not easier.
> -Sarah
>
> -----Original Message-----
> From: owner-statalist@hsphsun2.harvard.edu
> [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of Liao, Junlin
> Sent: Thursday, February 17, 2011 2:48 PM
> To: statalist@hsphsun2.harvard.edu
> Subject: RE: st: A bug in egen and gen?
>
> I took from Bill Gould's article on precision for two points:
>
> 1. Stata makes all calculations in double precision, and 2. float provides
> more than enough precision for most data applications
>
> What really matters comes down to storage. The experiments did not include
> decimal numbers, therefore, the final storage space would not differ. The
> point Nick tried to make is that not all numbers are integers, then compress
> would not make any difference since compress does not convert decimal double
> numbers to float type (even though it should when the recast does not change
> the variable values). I looked at my data files. The largest of them all is
> a dataset with 135MB. If I convert all float numbers to double, I saw an
> increase of size 14% (to 155MB). That is indeed a significant waste of
> storage. However, I would still argue for precision. The storage capacity
> increase as fast as memory does. In fact I keep a few process variables for
> convenience. If I drop those variables, my file size can be reduced to 129MB
> with all decimal variable in double type and 119MB with all decimal in float
> type (difference reduces to less than 10% now). I can reduce further the
> size of my f!
>  ile by getting rid of calculated fields. You can blame me as careless in
> keeping my files. But in reality I have hundreds of giga bytes wasted every
> day (sitting there idle) any way. My point is that the storage factor is not
> that important realistically.
>
> In the old days storage and memory mattered a lot. The programs were much
> smaller and probably more efficient. The fast increase in computing power is
> making the distinction of float and double type numbers increasingly
> irrelevant. For Bill's view, float may indeed provide more than enough
> precision for most data applications, however, if precision can be gained at
> negligible cost--I seriously doubt anyone today running Stata have a
> capacity constraint issue with their computers, why not? I do a lot of data
> analysis; but still, before my next upgrade of computer, the possibility of
> running out of disk space because I set my precision to double is zero.
> That's why I choose to change the default to double and recommend others to
> do so as well.
>
> Tx,
>
> Junlin
>
> -----Original Message-----
> From: owner-statalist@hsphsun2.harvard.edu
> [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of Maarten buis
> Sent: Thursday, February 17, 2011 3:09 PM
> To: statalist@hsphsun2.harvard.edu
> Subject: RE: st: A bug in egen and gen?
>
> --- On Thu, 17/2/11, Liao, Junlin wrote:
>> I just fail to see your point " You can get some of that back by
>> -compress-, but not all ". My experiment clearly proves that what
>> matters is the "final" storage data type. I understand that by using
>> double in place of float or long will increase requirement of memory.
>> My point is that computing power is increasing exponentially. For
>> example, any computers I use have at least 4GB of memory.
>> The machine I load with Stata has 8GB. Memory is least of my concerns,
>> but accuracy is always important.
>
> If you store real data as double you are trying to regain accuracy that does
> not exist in your data. All you have done is doubled the size of your file
> to store random noise. -compress- will only help avoid this if your
> variables are all integers.
>
> There are situations where storing or generating variables as doubles make
> sense, but they are the exception not the rule.
>

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


________________________________
Notice: This UI Health Care e-mail (including attachments) is covered by the Electronic Communications Privacy Act, 18 U.S.C. 2510-2521, is confidential and may be legally privileged.  If you are not the intended recipient, you are hereby notified that any retention, dissemination, distribution, or copying of this communication is strictly prohibited.  Please reply to the sender that you have received the message in error, then delete it.  Thank you.
________________________________

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index