Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Creating a smaller dataset from a larger one.


From   Maarten Buis <maartenlbuis@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Creating a smaller dataset from a larger one.
Date   Mon, 13 Aug 2012 20:01:55 +0200

On Mon, Aug 13, 2012 at 4:31 PM, Amal Khanolkar wrote:
> I have a very large dataset with almost 3 million subjects - great to work with, but however a bit difficult to transport or carry with me. I prefer to create a smaller sub-dataset with say 100,000 subjects chosen at random.

Alternatively, you could select the variables you want to keep and use
-contract-. For each unique combination of these variables it keeps
only one observation but records how many observations that represents
in a new variable _freq. You can than add the -[fw=_freq]- statement
to all subsequent commands, and thus keep all the information from
your original dataset. If your variables are all categorical the
reduction in size (and speed up of execution of commands) can be
spectacular. However, even with continuous variables the save can be
considerable, as continuous variables are hardly ever as continuous as
we think.

Another way to reduce the size of a dataset without loosing
information is -compress-.

Hope this helps,
Maarten

---------------------------------
Maarten L. Buis
WZB
Reichpietschufer 50
10785 Berlin
Germany

http://www.maartenbuis.nl
---------------------------------
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index