Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Equivalent of Excel's COUNTIF


From   Robert Picard <[email protected]>
To   [email protected]
Subject   Re: st: Equivalent of Excel's COUNTIF
Date   Wed, 16 Nov 2011 15:15:49 -0500

You can avoid the -reshape- overhead by duplicating each patient
record directly using -expand-. Also, there is no need to do another
-reshape- to get back to the original data form, you can simply drop
the extra record. Here's a reworked example using Stata datetime
internal format:

*----------- begin example -------------

clear all
set seed 10101

* the number of milliseconds in a day
scalar mday = 24 * 60 * 60 * 1000

* create patient records
set obs 10000
gen double timein = clock("2011 1 1","YMD") + runiform()*180*mday
gen double timeout = timein + runiform()*10*mday
format timein timeout %tc
gen id = _n

* duplicate observations and order by event time
expand 2
sort id
by id: gen double etime = cond(_n==1,timein,timeout)
by id: gen event = cond(_n==1,1,-1)
sort etime id

* the population goes up or down
gen pop = sum(event) - 1

* drop the extra record
drop if event == -1
drop etime event

*------------ end example --------------



On Wed, Nov 16, 2011 at 2:30 PM, Richard Herron
<[email protected]> wrote:
> The -reshape- solution is faster than the loop solution (timer 1 vs
> timer 2 below). With 1e5 individuals the loop solution was beyond my
> patience.
>
> timer list
>   1:      0.12 /        1 =       0.1240
>   2:     13.42 /        1 =      13.4210
>
> I had to modify our solutions a little.
>
> * begin code
> timer clear
> timer on 1
> clear
> set obs 10000
> set seed 10101
> generate long id = _n
> generate datein = runiform()*5000
> generate dateout = datein + runiform()*15
>
> reshape long date, i(id) j(inout) string
> sort date
> tempvar change
> generate int `change' = cond(inout == "in", 1, -1)
> generate int total = sum(`change') - 1
> timer off 1
> timer list
>
> timer on 2
> clear
> set obs 10000
> set seed 10101
> generate arrival = runiform()*5000
> generate discharge = arrival + runiform()*15
> gen long npatients = .
> gen long _num_discharged = .
> sort arrival
> forvalues k=1/`=_N' {
>  quietly replace _num_discharged = sum( discharge <= arrival[`k'] )
>  quietly replace npatients = (_n-1) - _num_discharged[`k'] in `k'
> }
> timer off 2
>
> timer list
> * end code
>
> On Wed, Nov 16, 2011 at 11:44, Stas Kolenikov <[email protected]> wrote:
>> On Wed, Nov 16, 2011 at 11:04 AM, Richard Herron
>> <[email protected]> wrote:
>>> Here is an alternative solution with -reshape-, -cond-, and -sum-.
>>
>> Cute solution!
>>
>>> The last two functions should be fast at any scale, but I don't have
>>> enough experience with Stata to know if -reshape- is faster than a
>>> loop.
>>
>> That's easy to check: set obs 10M instead of 10, and see what will be
>> faster (and whether the -reshape- will start breaking down with large
>> data sets; it might or it might not). -reshape- appears to be using a
>> lot of I/O with explicit -use-, -save- and -merge- in the code; I
>> thought this would have been written in C or Mata -- there's a
>> reshape() function in Mata.
>>
>> --
>> Stas Kolenikov, also found at http://stas.kolenikov.name
>> Small print: I use this email account for mailing lists only.
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/statalist/faq
>> *   http://www.ats.ucla.edu/stat/stata/
>>
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index