Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: calculating cumulative values of other observations


From   Eric Booth <[email protected]>
To   [email protected]
Subject   Re: st: calculating cumulative values of other observations
Date   Sat, 7 Apr 2012 16:42:21 -0500

<>

Or you can do it in one line with the cond() function:

******************!
//input example data//
clear
inp ID	str18 (date_start date_end)  str4 activity hours   	
1	20071230 20071230 A	10	   			
1	20071122 20071129 A	11	  
2	20071120 20071120 A	5	  
3	20071210 20071210 B	6	  
3	20071111 20071111 B	24	  
3	20071130 20071130 F	20	  
3	20070511 20070511 B	49	  
4	20070611 20070611 A	50	  
end

**nick's original code
bysort ID activity (date_start): gen ph = sum(hours) - hours

**ind
bysort ID activity (date_start): gen ind = 1 if ///
  	(date_start>=date_end[_n-1] & !mi(date_end[_n-1]))

**priorhours
bysort ID activity (date_start): gen priorhrs = ///
  	sum(hours) - hours 
  	replace priorhrs = 0 if  priorhrs>0 & ///
  	(date_start<date_end[_n-1] & !mi(date_end[_n-1]))

**with cond()
bys ID activity (date_start) : g oneline =  ///
	cond((date_start>=date_end[_n-1] & ///
	!mi(date_end[_n-1])), sum(hours) - hours, 0, .)
assert ph == oneline //check
********************!


- Eric

__
Eric A. Booth
Public Policy Research Institute 
Texas A&M University
[email protected]
+979.845.6754

On Apr 7, 2012, at 4:36 PM, Eric Booth wrote:

> <>
> 
> Your new [if] condition is preventing the adapted version of Nick's code from creating the running sum properly.  To see how the [if] condition you added is "filtering" the observations that are eligible for the running sum, create an indicator/flag for your [if] conditions and inspect the results:
> ***
>  bysort ID activity (date_start): gen ind = 1 if ///
>  	(date_start>=date_end[_n-1] & !mi(date_end[_n-1]))
> ***
> 
> You should be able to see that this stops Stata from including the observations you wanted included in the sum using 'ind'.
> 
> Without seeing your new variables - you could probably break your new code into two steps like:
> ***
> bysort ID activity (date_start): gen priorhrs = ///
>  	sum(hours) - hours 
> replace priorhrs = 0 if  priorhrs>0 & ///
>  	(date_start<date_end[_n-1] & !mi(date_end[_n-1]))
> ***
> to get what you need.
> 
> - Eric
> __
> Eric A. Booth
> Public Policy Research Institute 
> Texas A&M University
> [email protected]
> +979.845.6754
> 
> On Apr 7, 2012, at 4:12 PM, KOTa wrote:
> 
>> thanks Eric,
>> i did adapt your code for real dataset without problems (see my
>> previous response)
>> 
>> i am just interested how to do it using Nick's code, which works much
>> faster (dataset is pretty large).
>> i am probably missing some thing about how "sum" works or about
>> boundary conditions
>> 
>> my rewrite of Nick was:
>> 
>> bysort ID activity (date_start): gen priorhrs = sum(hours) - hours if
>> (date_start>=date_end[_n-1] & !mi(date_end[_n-1]))
>> 
>> which does not work :/
>> 
>> regards
>> K.
>> 
>> El día 7 de abril de 2012 14:09, Eric Booth <[email protected]> escribió:
>>> <>
>>> 
>>> On Apr 7, 2012, at 9:42 AM, KOTa wrote:
>>> 
>>>> thanks Erik, Nick
>>>> 
>>>> both ways work on this sample, and i managed to adjust for a real database also.
>>>> 
>>>> few followup question:
>>>> 
>>>> 1. Eric, any specific reason why you convert time into date format? I
>>>> used it without conversion and it looks fine.
>>>> Also on real data your solution somehow goes into infinite loop after
>>>> finishing all calculations.
>>>> 
>>> I don't know how you edited the code, so I don't know how you produced an infinite loop?  It certainly doesn't do that with the example I gave you.  Also, I created a time/date format version of "date" just out of habit, I find that it makes data management easier.  If you don't need/want it, then skip it.
>>> 
>>> 
>>>> 2. Nick, the solution you proposed with "sum" was the one i tried
>>>> initially. The problem is that in real data I have 2 columns for date
>>>> (start_date end_date) and i want to compare start_date [_n] >=
>>>> end_date[all previous]
>>>> Eric's solution works longer, but i could adjust it to work with
>>>> different date columns.
>>>> your solution is works much faster in case there is one date column,
>>>> but i could not figure out how to do the same with 2 different
>>>> columns.
>>> 
>>> El día 7 de abril de 2012 04:08, Nick Cox <[email protected]> escribió:
>>>> Why not
>>>> bysort id activity (time): gen priorhrs = sum(time) - time
>>>> Nick
>>> 
>>> 
>>> Nick's -sum- solution does work (and is more straightforward than my -replace line)  with a slight tweak - changing the first "time" to "date2" (if you create the time/date var from my example).  Also, if you are using the same variable names from your original example you'd change "time" to "hours".  Adding this to my example works:
>>> 
>>> bysort ID activity (date2): gen njc = sum(hours) - hours
>>> 
>>> 
>>> Adapting mine and Nick's code to your "real" dataset is another issue completely - if you cannot figure how to adapt our tips to your data/code, then you should post a data & code example/question that is closer to reality.
>>> 
>>> 
>>> - Eric
>>> __
>>> Eric A. Booth
>>> Public Policy Research Institute
>>> Texas A&M University
>>> [email protected]
>>> Office: +979.845.6754
>>> 
>>> 
>>> 
>>> *
>>> *   For searches and help try:
>>> *   http://www.stata.com/help.cgi?search
>>> *   http://www.stata.com/support/statalist/faq
>>> *   http://www.ats.ucla.edu/stat/stata/
>> 
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/statalist/faq
>> *   http://www.ats.ucla.edu/stat/stata/
> 


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index