Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: st: data manipulation prob.


From   Nick Cox <n.j.cox@durham.ac.uk>
To   "'statalist@hsphsun2.harvard.edu'" <statalist@hsphsun2.harvard.edu>
Subject   RE: st: data manipulation prob.
Date   Thu, 7 Jun 2012 18:41:45 +0100

I am going to guess that there is a panel structure too, hidden from this example. Consider 

bysort id (date) : gen sumhits = sum(hits) 
by id : egen when_halfway = min(date / (sumhits >= (sumhits[_N] / 2))) 
by id : gen time_halfway = when_halfway - date[1] 

For more on the trick in the second line, see 

SJ-11-2 dm0055  . . . . . . . . . . . . . .  Speaking Stata: Compared with ...
        . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  N. J. Cox
        Q2/11   SJ 11(2):305--314                                (no commands)
        reviews techniques for relating values to values in other
        observations

With no panel structure, this could be 

sort date 
gen sumhits = sum(hits) 
su date if sumhits >= (sumhits[_N] / 2)
di r(min) - date[1]

The underlying principle is tautological: the first date on which something is true is just the minimum date satisfying that condition. 

Nick 
n.j.cox@durham.ac.uk 

tashi lama

You guessed that right. I could have selected my dataset little random. Yes, my dataset could be really random. I have an idea though, just can't think enough of stata to do it 

 

       date                   hits
     |---------------------------|
  1. | 10mar2011 01:07:18      2 |
  2. | 10mar2011 01:09:48      3 |
  3. | 10mar2011 01:54:00      1 |
  4. | 10mar2011 02:03:37      8 |
  5. | 10mar2011 02:11:00      9 |
     |---------------------------|
  6. | 10mar2011 02:26:00      5 |
  7. | 10mar2011 02:46:00     12 |
  8. | 10mar2011 02:47:00     34 |
  9. | 10mar2011 02:51:09     14 |
 10. | 10mar2011 02:51:24     80 |
     +---------------------------+


gen runhits=sum(hits)

list 

             date            hits   runhits |
     |-------------------------------------|
  1. | 10mar2011 01:07:18      2         2 |
  2. | 10mar2011 01:09:48      3         5 |
  3. | 10mar2011 01:54:00      1         6 |
  4. | 10mar2011 02:03:37      8        14 |
  5. | 10mar2011 02:11:00      9        23 |
     |-------------------------------------|
  6. | 10mar2011 02:26:00      5        28 |
  7. | 10mar2011 02:46:00     12        40 |
  8. | 10mar2011 02:47:00     34        74 |
  9. | 10mar2011 02:51:09     14        88 |
 10. | 10mar2011 02:51:24     80       168 


gen x=(runhits>ceil(runhits[_N]/2))

list 

             date   hits   runhits           x 
     |-----------------------------------------|
  1. | 10mar2011 01:07:18      2         2   0 |
  2. | 10mar2011 01:09:48      3         5   0 |
  3. | 10mar2011 01:54:00      1         6   0 |
  4. | 10mar2011 02:03:37      8        14   0 |
  5. | 10mar2011 02:11:00      9        23   0 |
     |-----------------------------------------|
  6. | 10mar2011 02:26:00      5        28   0 |
  7. | 10mar2011 02:46:00     12        40   0 |
  8. | 10mar2011 02:47:00     34        74   0 |
  9. | 10mar2011 02:51:09     14        88   1 |
 10. | 10mar2011 02:51:24     80       168   1 |


Now, I could do sth like 

di date[n]-date[1] where n=obs number when x=1 the first time although we could generate another variable  "indicator" which will have only single "1". In any case, I need a mechanish to get an obs no when x=1. Hope this helps...

Nick Cox 

> On the last question first: the usual Stata way is to add observations
> at the end and then -sort-, although you could also -append- to a
> one-observation dataset.
>
> If -hits- is always 1, then
>
> sort date
> gen obs = _n
> su obs, meanonly
> di date[ceil(r(mean))] - date[1]
>
> I guess you will now tell us that the real data are more complicated.

On Wed, Jun 6, 2012 at 10:24 PM, tashi lama <ltashi32@hotmail.com> wrote:

> > date hits |
> > |---------------------------|
> > 1. | 10mar2011 01:07:18 1 |
> > 2. | 10mar2011 01:09:48 1 |
> > 3. | 10mar2011 01:54:00 1 |
> > 4. | 10mar2011 02:03:37 1 |
> > 5. | 10mar2011 02:11:00 1 |
> > |---------------------------|
> > 6. | 10mar2011 02:26:00 1 |
> > 7. | 10mar2011 02:46:00 1 |
> > 8. | 10mar2011 02:47:00 1 |
> > 9. | 10mar2011 02:51:09 1 |
> > 10. | 10mar2011 02:51:24 1 |
> >
> > I need to find the time taken to get half of the total hits
> >
> > summ hits
> >
> > gen runsum=sum(hits)
> >
> > date hits x |
> > |---------------------------------|
> > 1. | 10mar2011 01:07:18 1 1 |
> > 2. | 10mar2011 01:09:48 1 2 |
> > 3. | 10mar2011 01:54:00 1 3 |
> > 4. | 10mar2011 02:03:37 1 4 |
> > 5. | 10mar2011 02:11:00 1 5 |
> > |---------------------------------|
> > 6. | 10mar2011 02:26:00 1 6 |
> > 7. | 10mar2011 02:46:00 1 7 |
> > 8. | 10mar2011 02:47:00 1 8 |
> > 9. | 10mar2011 02:51:09 1 9 |
> > 10. | 10mar2011 02:51:24 1 10 |
> >
> > Now, the prob I am having is I will be comparing r(sum) in var "x" but I need to compute in var "date". So, if r(sum)/2 is 5 then i know to subtract date[5]-date[1]. Any idea?
> >
> > Also, is it possible to add one date observation on top in date column programmatically. So, I need to add 07mar2011 03:00:00 in date column and because this date comes first than other obs in the dataset, I need to make this as my first obs.


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index