Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

# RE: st: data manipulation prob.

 From Nick Cox <[email protected]> To "'[email protected]'" <[email protected]> Subject RE: st: data manipulation prob. Date Thu, 7 Jun 2012 18:41:45 +0100

```I am going to guess that there is a panel structure too, hidden from this example. Consider

bysort id (date) : gen sumhits = sum(hits)
by id : egen when_halfway = min(date / (sumhits >= (sumhits[_N] / 2)))
by id : gen time_halfway = when_halfway - date[1]

For more on the trick in the second line, see

SJ-11-2 dm0055  . . . . . . . . . . . . . .  Speaking Stata: Compared with ...
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  N. J. Cox
Q2/11   SJ 11(2):305--314                                (no commands)
reviews techniques for relating values to values in other
observations

With no panel structure, this could be

sort date
gen sumhits = sum(hits)
su date if sumhits >= (sumhits[_N] / 2)
di r(min) - date[1]

The underlying principle is tautological: the first date on which something is true is just the minimum date satisfying that condition.

Nick
[email protected]

tashi lama

You guessed that right. I could have selected my dataset little random. Yes, my dataset could be really random. I have an idea though, just can't think enough of stata to do it

date                   hits
|---------------------------|
1. | 10mar2011 01:07:18      2 |
2. | 10mar2011 01:09:48      3 |
3. | 10mar2011 01:54:00      1 |
4. | 10mar2011 02:03:37      8 |
5. | 10mar2011 02:11:00      9 |
|---------------------------|
6. | 10mar2011 02:26:00      5 |
7. | 10mar2011 02:46:00     12 |
8. | 10mar2011 02:47:00     34 |
9. | 10mar2011 02:51:09     14 |
10. | 10mar2011 02:51:24     80 |
+---------------------------+

gen runhits=sum(hits)

list

date            hits   runhits |
|-------------------------------------|
1. | 10mar2011 01:07:18      2         2 |
2. | 10mar2011 01:09:48      3         5 |
3. | 10mar2011 01:54:00      1         6 |
4. | 10mar2011 02:03:37      8        14 |
5. | 10mar2011 02:11:00      9        23 |
|-------------------------------------|
6. | 10mar2011 02:26:00      5        28 |
7. | 10mar2011 02:46:00     12        40 |
8. | 10mar2011 02:47:00     34        74 |
9. | 10mar2011 02:51:09     14        88 |
10. | 10mar2011 02:51:24     80       168

gen x=(runhits>ceil(runhits[_N]/2))

list

date   hits   runhits           x
|-----------------------------------------|
1. | 10mar2011 01:07:18      2         2   0 |
2. | 10mar2011 01:09:48      3         5   0 |
3. | 10mar2011 01:54:00      1         6   0 |
4. | 10mar2011 02:03:37      8        14   0 |
5. | 10mar2011 02:11:00      9        23   0 |
|-----------------------------------------|
6. | 10mar2011 02:26:00      5        28   0 |
7. | 10mar2011 02:46:00     12        40   0 |
8. | 10mar2011 02:47:00     34        74   0 |
9. | 10mar2011 02:51:09     14        88   1 |
10. | 10mar2011 02:51:24     80       168   1 |

Now, I could do sth like

di date[n]-date[1] where n=obs number when x=1 the first time although we could generate another variable  "indicator" which will have only single "1". In any case, I need a mechanish to get an obs no when x=1. Hope this helps...

Nick Cox

> On the last question first: the usual Stata way is to add observations
> at the end and then -sort-, although you could also -append- to a
> one-observation dataset.
>
> If -hits- is always 1, then
>
> sort date
> gen obs = _n
> su obs, meanonly
> di date[ceil(r(mean))] - date[1]
>
> I guess you will now tell us that the real data are more complicated.

On Wed, Jun 6, 2012 at 10:24 PM, tashi lama <[email protected]> wrote:

> > date hits |
> > |---------------------------|
> > 1. | 10mar2011 01:07:18 1 |
> > 2. | 10mar2011 01:09:48 1 |
> > 3. | 10mar2011 01:54:00 1 |
> > 4. | 10mar2011 02:03:37 1 |
> > 5. | 10mar2011 02:11:00 1 |
> > |---------------------------|
> > 6. | 10mar2011 02:26:00 1 |
> > 7. | 10mar2011 02:46:00 1 |
> > 8. | 10mar2011 02:47:00 1 |
> > 9. | 10mar2011 02:51:09 1 |
> > 10. | 10mar2011 02:51:24 1 |
> >
> > I need to find the time taken to get half of the total hits
> >
> > summ hits
> >
> > gen runsum=sum(hits)
> >
> > date hits x |
> > |---------------------------------|
> > 1. | 10mar2011 01:07:18 1 1 |
> > 2. | 10mar2011 01:09:48 1 2 |
> > 3. | 10mar2011 01:54:00 1 3 |
> > 4. | 10mar2011 02:03:37 1 4 |
> > 5. | 10mar2011 02:11:00 1 5 |
> > |---------------------------------|
> > 6. | 10mar2011 02:26:00 1 6 |
> > 7. | 10mar2011 02:46:00 1 7 |
> > 8. | 10mar2011 02:47:00 1 8 |
> > 9. | 10mar2011 02:51:09 1 9 |
> > 10. | 10mar2011 02:51:24 1 10 |
> >
> > Now, the prob I am having is I will be comparing r(sum) in var "x" but I need to compute in var "date". So, if r(sum)/2 is 5 then i know to subtract date[5]-date[1]. Any idea?
> >
> > Also, is it possible to add one date observation on top in date column programmatically. So, I need to add 07mar2011 03:00:00 in date column and because this date comes first than other obs in the dataset, I need to make this as my first obs.

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```