Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Splitting Dataset - Save by unique identifier

From	Nick Cox <[email protected]>
To	[email protected]
Subject	Re: st: Splitting Dataset - Save by unique identifier
Date	Sun, 28 Oct 2012 11:23:18 +0000

Note that

 su obsno if permgroup == `g'

should be

 su obsno if permgroup == `g', meanonly

On Sun, Oct 28, 2012 at 11:21 AM, Nick Cox <[email protected]> wrote:
> We can't advise on speeding up code you don't show us or don't explain.
>
> In general,
>
> 0. Stata is fairly fast so long as it can hold all data in memory.
> What's fastest are built-in commands written in C (invisible to the
> user) and/or Mata (partly visible to the user). What's slower is the
> same problem approached as interpreted ado code. What's slowest of all
> is writing your code to loop over observations, as in one of your
> previous posts. Only rarely is that the best practical approach.
>
> 1. What's best for you depends on (a) how big your dataset is, (b)
> what your computer set-up is and (c) what you're doing. Even if we
> knew all that, there is a still a sense in which only experiments
> given (a) (b) (c) can imply what's fastest for you. You will know
> this!
>
> 2. That said, my visceral feeling is that reading in the same dataset
> 400 times can't be the best way to do something, nor can splitting a
> dataset into 10000 smaller datasets.
>
> 3. You may not be aware of Blasnik's Law, not even without knowing
> that name. (I named this law after Michael Blasnik, who did a lot on
> this list to make clear how much it can bite.)
>
> See e.g. http://www.stata.com/statalist/archive/2007-09/msg00264.html
> for an example, but I note that the term was in use at least by 2004.
>
> Blasnik's Law is that whenever a task can be done using -if- and
> equivalently using -in-, then the -in- solution will be (much) faster.
>
> In your case anything that centres on
>
> <some stuff>  if permno == <some value>
>
> can be very slow because Stata will just test every observation to
> work out whether the -if- condition is true. This will often be faster
>
> sort permno <whatever>
> * regardless of any irregularities in -permno-, -permgroup- will take
> values 1 up
> egen permgroup = group(permno)
> su permgroup, meanonly
> local gmax = r(max)
> gen long obsno = _n
>
> forval g = 1/`gmax' {
>                su obsno if permgroup == `g'
>                local min = r(min)
>                local max = r(max)
>
>                <all operations for this group>  in `min'/`max'
> }
>
> See also
>
> SJ-7-3  st0135  . . . . . . . . . . . Stata tip 50: Efficient use of summarize
>         . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  N. J. Cox
>         Q3/07   SJ 7(3):438--439                                 (no commands)
>         tip on using the meanonly option of summarize to more
>         quickly determine the mean and other statistics of a
>         large dataset
>
> Directly accessible at http://www.stata-journal.com/sjpdf.html?articlenum=st0135
>
> Specifically, as above I can't see what you are imagining -- splitting
> into thousands of small datasets -- is a good idea, but on how to do
> it: see -savesome- (SSC) as a convenience command which you would need
> to call in a loop. It does _not_ rule out reading the whole dataset
> back in once you have -save-d a part of it.
>
> Nick
>
> On Sat, Oct 27, 2012 at 10:28 PM, Tim Streibel <[email protected]> wrote:
>
>> I am having a question I am currently computing abnormal returns in a way that implies opening a large dataset (about 2m obs.) about 400 times which I think costs a lot of time.
>>
>> So my idea is to create small datasets (for each stock one dataset). Is there a way to quickly create a dataset only containing the observations of one stock (uniquely identified by Permno)?
>>
>> Currently my only idea is to open the large dataset drop all obs. except the ones of one stock and save it. But doing that for every stock forces me to open the large dataset 10 000 times, so it doesn't really save me time.
>>
>> Some combination of by (permno) and save would be cool.
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

References:
- st: Splitting Dataset - Save by unique identifier
  - From: "Tim Streibel" <[email protected]>
- Re: st: Splitting Dataset - Save by unique identifier
  - From: Nick Cox <[email protected]>

Prev by Date: Re: st: Splitting Dataset - Save by unique identifier
Next by Date: Re: st: Is it possible to use --rename-- with the renumber option to rename variables in reverse order?
Previous by thread: Re: st: Splitting Dataset - Save by unique identifier
Next by thread: Re: st: Splitting Dataset - Save by unique identifier
Index(es):
- Date
- Thread