Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: RE: st: A bug in egen and gen?

From	"Liao, Junlin" <[email protected]>
To	"[email protected]" <[email protected]>
Subject	RE: RE: st: A bug in egen and gen?
Date	Fri, 18 Feb 2011 15:22:48 +0000

My original post was about data input -gen- and -egen- functions. What happened was that when I input large numbers, I got wrong data. Then of course the data type setting comes into the play and followed by storage space issues. I think the sensible thing to do was for Stata to fix -gen- and -egen- commands where the users do not need to specify data type. Stata has already had the capability to do it. Stata calculates with double precision and had the correct answer at hand but presented with the wrong data type. It's a simple suggestion. Setting type to double is a compromise. However, this compromise is also necessary because other data importing procedures depends on it.

I do not have datasets in millions. But I occasionally run dataset with observations in hundreds of thousands. Yet I fail to see the advantage of saving 10% of storage space and memory. I do experience memory constraint in analysis. But I do not think 10% dataset size reduction can do anything about it. It may be different for someone who runs financial analysis though. You may have most of the variables in decimal and double could very well inflate your dataset to double its size. But I doubt the majority of Stata users are not in that camp.

When we talk about best practice, I think there is best practice for the industry as well. I tested SAS, SPSS, and MS Access. None of them has the problem. MS Access as personal database always defaults to double. SAS and SPSS only have a numeric data type. They all can get the numbers accurately without additional user input. Shouldn't they care about dataset size as well? I think they do. There may be a valid point to argue for float where double would give you higher precision at the expense of storage space, however, my original problem is Stata setting float type for what should be long integer.

The -compress- command missing opportunity to reduce double to float can be easily demonstrated.

. set type double

. clear

. gen a=4.1

. compress

. des a

              storage  display     value
variable name   type   format      label      variable label
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
a               double %10.0g

. recast float a

. des a

              storage  display     value
variable name   type   format      label      variable label
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
a               float  %10.0g


Junlin

-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Christopher Baum
Sent: Friday, February 18, 2011 8:12 AM
To: [email protected]
Subject: re: RE: st: A bug in egen and gen?

<>
I would just add one thing to this discussion. I somehow doubt that the original poster is working with very large (million-obs+) datasets with hundreds or thousands of variables. There are those of us who encounter and struggle with such datasets. He is correct in noting that advances in readily available technology remove a lot of constraints: no reason nowadays to work with a 32-bit operating system (especially since those that are are really lame), no reason to have less than 4 Gb of RAM or 0.5 terabyte hard disk, etc. on even a relatively inexpensive new machine.

BUT 4 Gb of RAM is not enough to analyze a number of commonly-used social science and finance data sets, even in their most parsimonious form, on any operating system supported by Stata. And disk space, while plentiful, should not be consumed without concern for the fact that reading and writing a .dta file that is possibly twice as large is quite a bit slower. Computers' speed improvements have not so readily extended to input/output, and until solid-state hard disks are ubiquitous and cheap, that's not going to happen without paying quite a bit more for a machine. His suggestion to automatically -compress- every time you use -save- would make that operation very tedious in a context where thousands of variables have to be evaluated. So there are good reasons for having a program that allows you to read and save floating-point numbers in single precision, especially when the innate precision of any number in, e.g., the national income accounts can be readily represented by !
 a single-precision ("float"). StataCorp's choice for Mata was to represent all numeric variables as doubles, but then I do not usually move my whole data set into Mata matrices.

Whether float or double should be the default data type is a matter of preference, and you are free to exercise that preference. If you work with relatively small data sets, you might well want to set precision to double as the default. For many of us who work with very large data sets, it would be a disastrous choice. What works very well for some users will not work well for others, and many Stata users face resource constraints: they cannot readily get a machine with 8 Gb or more of RAM, or a larger hard disk---or even an upgrade to Stata 11! Keep in mind that not all users have ready access to the latest and greatest that the computer industry has to offer, but they still want to take advantage of Stata.

Kit

PS> On the subject of egen and its alleged deficiencies: -egen- is pure ado-file code, of the sort that anyone can write. If the complainer wants to write his own improved version of a program that does what -egen- does, but does it to his liking, he is free to do so and share it via SSC with other users.

Kit Baum   |   Boston College Economics & DIW Berlin   |   http://ideas.repec.org/e/pba1.html
                              An Introduction to Stata Programming  |   http://www.stata-press.com/books/isp.html
   An Introduction to Modern Econometrics Using Stata  |   http://www.stata-press.com/books/imeus.html


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


________________________________
Notice: This UI Health Care e-mail (including attachments) is covered by the Electronic Communications Privacy Act, 18 U.S.C. 2510-2521, is confidential and may be legally privileged.  If you are not the intended recipient, you are hereby notified that any retention, dissemination, distribution, or copying of this communication is strictly prohibited.  Please reply to the sender that you have received the message in error, then delete it.  Thank you.
________________________________

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- RE: RE: st: A bug in egen and gen?
  - From: "Sarah Edgington" <[email protected]>
- RE: RE: st: A bug in egen and gen?
  - From: Maarten buis <[email protected]>
- RE: RE: st: A bug in egen and gen?
  - From: Nick Cox <[email protected]>

References:
- re: RE: st: A bug in egen and gen?
  - From: Christopher Baum <[email protected]>

Prev by Date: Re: st: A bug in egen and gen?
Next by Date: RE: st: A bug in egen and gen?
Previous by thread: re: RE: st: A bug in egen and gen?
Next by thread: RE: RE: st: A bug in egen and gen?
Index(es):
- Date
- Thread