Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: st: converting high frequency data to low frequency


From   Nick Cox <[email protected]>
To   "'[email protected]'" <[email protected]>
Subject   RE: st: converting high frequency data to low frequency
Date   Fri, 5 Nov 2010 11:59:13 +0000

David's suggestion strikes me as right in principle, but I think he's still thinking in terms of the bad old days before Stata 10 when people had to work out their own awkward  ways of handling times of day. That's a misunderstanding here. 

As always, the _format_ of these data is a matter of how they are to be displayed, and not a matter of how they are stored. (An article on the most common misunderstandings of Stata would surely include this one.) 

Dimitry's data look exactly like standard Stata date-times, allowed in Stata 10 up, meaning that underneath the cosmetic format they are times in milliseconds (ms). Therefore, he wants to round in units of 1000 * 60 * 5 = 300000. 

Here is a concrete example which covers everything needed to understand this problem. 

Using a %tc format for a -clock()- conversion of 11:31:00 today gives us back, not surprisingly, the same information: 

. di %tc  clock("5 Nov 2010 11:31:00", "DMYhms")
05nov2010 11:31:00

But underneath all that, the precise date-time _really_ is just an integer with units ms. 

. di %20.0f  clock("5 Nov 2010 11:31:00", "DMYhms")
       1604575860000

(The "20" in the format is much more than I need but causes no problem here.) 

You can round down or round up; which way you go is a matter of taste or convention. I almost never round using -int()-. I almost always round using -floor()- or -ceil()- because then I know immediately that I am rounding down (-floor()-) or up (-ceil()-;  think ceiling) and I don't get bit around 0 because the way -int()- works with negative numbers is not what I usually want, except that I might forget that or not foresee it might happen with my data. 

Now rounding down, for example, in units of 5 minutes is rounding down in units of 300000 ms. There are three steps, except that they can be combined in one line:

1. Divide by 300000. 

2. Round down to the next integer below. 

3. Multiply by 300000. 

So, the result is another large integer, 

. di %20.0f  300000 * floor(clock("5 Nov 2010 11:31:00", "DMYhms")/300000)
       1604575800000

But we should check that we did it right:

. di %tc  300000 * floor(clock("5 Nov 2010 11:31:00", "DMYhms")/300000)
05nov2010 11:30:00

With a variable it's going to be 

gen double binnedtime = 300000 * floor(ordertime/300000) 
format binnedtime %tc 

Never forget the -double-. Then you can -collapse- (or better -contract-) in terms of the new variable. (If it's really just time of day you care about, you must get there first by subtraction.) 

(I suggested generalising -floor()- and -ceil()- some years ago to StataCorp so that with two arguments -floor(ordertime, 300000), say, would do what is above, but the suggestion is still lurking in their files. A good argument against would be that the long-winded way to do it, as above, is easy enough.) 

See also if desired 

SJ-3-4  dm0002  . . . . . . . . Stata tip 2: Building with floors and ceilings
        . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  N. J. Cox
        Q4/03   SJ 3(4):446--447                                 (no commands)
        tips for using floor() and ceil()


Nick 
[email protected] 

David Kantor
============

I believe you need to do just what you stated. In particular, if h, m 
, and s represent the hours, minutes and seconds, then you want
  gen long s2 = 3600 * h + 60 * m + s

then take int(s2/300)

(You may want to s2 to include date*24*3600 -- if you don't use the 
date as a separate component of the identifier.)

There may be other methods available that I'm not aware of.

Dimitry Masterov
================

>I have data on orders that looks like this:
>
>store_id        ordertime
>1901    10oct2010 15:48:18
>1901    10oct2010 15:50:36
>1901    10oct2010 16:41:35
>1901    10oct2010 16:53:37
>1901    10oct2010 18:38:24
>1901    10oct2010 19:09:22
>1901    10oct2010 21:14:22
>.....
>1906    10oct2010 17:54:59
>1906    12oct2010 19:21:27
>1906    17oct2010 16:10:01
>1906    17oct2010 19:34:39
>1906    17oct2010 21:11:00
>1906    17oct2010 22:18:55
>1906    17oct2010 22:55:13
>1906    19oct2010 22:23:10
>1906    20oct2010 01:19:24
>
>Ordertime is in %tc format. I would like to count the number of orders
>in 5 minute buckets by store_id. I would like to use
>
>collapse (count) ordertime, by(store_id bucket)
>
>but I have no idea how to group the ordertime into 5 minute intervals
>that are nicely labeled. I imagine I could calculate the elapsed time
>since midnight for each day, divide that into 5 minute intervals (or
>6000 unit intervals), and find some way to label, but I have not
>gotten this to work yet. Is there's an easier way? I tried googling
>and findit-ing, but I am not sure I know what this type of
>transformation is called.


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index