Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Problem with Stata handling of large dataset


From   Maarten Buis <[email protected]>
To   [email protected]
Subject   Re: st: Problem with Stata handling of large dataset
Date   Tue, 6 Aug 2013 10:08:08 +0200

On Mon, Aug 5, 2013 at 8:46 PM, Palan, Stefan wrote:
> Let me describe two different problems I encountered. When I type the following...
>
> clear
> set obs 20000000
> gen id=_n
> tsset id
>
> ...I get an error, since the values of id are not unique. This seems to be an issue having to do with the data type Stata uses for id. If I explicitly define id as type "long", the problem goes away. So I guess I should not rely on Stata to choose the appropriate variable type?

That depends on what you call correct. The problem is that there is a
trade off involved between the precision with which a number is stored
and the amount of memory that is being used. For real data anything
beyond float is (almost) always overkill. Increasing the precision
will just mean that more random noise is stored, which will just
result in false confidence in the data. There are a number of examples
that illustrate how rediculous that would be in section 5.5 of
http://www.blog.stata.com/2012/04/02/the-penultimate-guide-to-precision/
Things are different when you are dealing with variables that are
supposed to identifiy the different units. Here you obviously want to
be precise. However, Stata obviously cannot know what your variables
are supposed to represent. So all it can do is choose a default
(float) and document that (-help generate-), and give the user the
option to change that default either temporatily or permanently (also
documentended in -help generate-).

> Another point I noticed when testing things. When I type...
>
> clear
> set obs 2
> gen x=_n
> by x: gen y=_n
>
> ...I get an error message that the values in x are not sorted. Do I have to explicitly sort by x prior to running the last command, even if the values in x are already in the correct order?

Yes. Alternatively, you use -bysort- or -by ..., sort-, see -help by-.
The logic is that sorting is a classic example of a very time
consuming operation and often multiple -by- commands are issued one
after another requiring the same ordering. So it is more efficient to
sort once.

-- Maarten

---------------------------------
Maarten L. Buis
WZB
Reichpietschufer 50
10785 Berlin
Germany

http://www.maartenbuis.nl
---------------------------------

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index