Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Creating a smaller dataset from a larger one.


From   Maarten Buis <[email protected]>
To   [email protected]
Subject   Re: st: Creating a smaller dataset from a larger one.
Date   Mon, 13 Aug 2012 20:01:55 +0200

On Mon, Aug 13, 2012 at 4:31 PM, Amal Khanolkar wrote:
> I have a very large dataset with almost 3 million subjects - great to work with, but however a bit difficult to transport or carry with me. I prefer to create a smaller sub-dataset with say 100,000 subjects chosen at random.

Alternatively, you could select the variables you want to keep and use
-contract-. For each unique combination of these variables it keeps
only one observation but records how many observations that represents
in a new variable _freq. You can than add the -[fw=_freq]- statement
to all subsequent commands, and thus keep all the information from
your original dataset. If your variables are all categorical the
reduction in size (and speed up of execution of commands) can be
spectacular. However, even with continuous variables the save can be
considerable, as continuous variables are hardly ever as continuous as
we think.

Another way to reduce the size of a dataset without loosing
information is -compress-.

Hope this helps,
Maarten

---------------------------------
Maarten L. Buis
WZB
Reichpietschufer 50
10785 Berlin
Germany

http://www.maartenbuis.nl
---------------------------------
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index