Statalist The Stata Listserver


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: programming assist, too many unique values for levels


From   n j cox <n.j.cox@durham.ac.uk>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: programming assist, too many unique values for levels
Date   Wed, 02 May 2007 17:02:20 +0100

Michael Blasnik already explained how you can cut the code
down and avoid -levels- (which in Stata 9 is called -levelsof-)
by using -egen, total()- and -egen, mean()-.

However, your code illustrates various points that arise
elsewhere, and so are worth brief comment.

How can code like this fragment be improved? (I have
indented your code, following standard precepts on
programming style.)

--------------------------------------- #0
gen time=.
levels pt, local(levels)
quietly foreach l of local levels {
	sum obstime if pt==`l'
	local total=r(sum)
	replace time=`total' if pt==`l'
}
---------------------------------------

1. Cut out the middle macro. The macro
used as message-bearer in

local total = r(sum)
replace ... = `total'

does no harm, but it is unnecessary.

--------------------------------------- #1
gen time=.
levels pt, local(levels)
quietly foreach l of local levels {
	sum obstime if pt==`l'
	replace time = r(sum) if pt==`l'
}
---------------------------------------

2. As you only want the sum, use -summarize, meanonly-.
With many variables, this is always worth doing.
-meanonly- is a dopey name, because it doesn't mean
what it says, but that's a issue aside.

-------------------------------------- #2
gen time=.
levels pt, local(levels)
quietly foreach l of local levels {
	sum obstime if pt==`l', meanonly
	replace time = r(sum) if pt==`l'
}
--------------------------------------

3. -pt- comes out of an -encode-, which yields integers
1 up. You might as well exploit that. That circumvents
problems with the limits on -levels- and the relative
inefficiency of -foreach-. The r-class result r(max)
should be treated like a macro, so that -forvalues-
sees its value, not the name.

-------------------------------------- #3
gen time=.
su pt, meanonly
quietly forval l = 1/`r(max)' {
	sum obstime if pt==`l', meanonly
	replace time = r(sum) if pt==`l'
}
--------------------------------------

4. In fact, you can do it directly without loops.
Michael might do it something like this.
(Note that Andrew already sorted by -pt-.

------------------------------------- #4
by pt : egen time = total(obstime)
--------------------------------------

5. However, this is more efficient, as
it avoids the interpretive overhead of -egen-
(-viewsource egen.ado- and -viewsource _gtotal.ado-
to see what I mean).

------------------------------------ #5
by pt : gen time = sum(obstime)
by pt : replace time = time[_N]
------------------------------------

Nick
n.j.cox@durham.ac.uk

Andrew O'Connor

I'm hoping someone can offer some help, I've been working on this for
some time now
I'm running STATA 8.2 SE and have a large dataset (>90,000 rows of data
with about 12,000 unique record numbers, multiple observations for the
same individual).
I'm trying to calculate a "time out of range" for each patient (i.e. the
proportion of each patients observation time that is predicted to be
greater than 140 assuming a linearly interpolated slope of acutally
measured blood pressures--not simply the proportion of blood pressure
readings that is > than my threshold).  I have 3 variables: MRN (medical
record number), Visit_date, bp_systolic

I've run into a problem due to the size of my data set, specifically
that I have too many levels.  Here is my code
   encode mrn, gen (pt)
sort pt visit_date
drop if bp_systolic==.
by pt:gen obstime =visit_date[_n+1]-visit_date
by pt:gen sys_diff=bp_systolic[_n+1]-bp_systolic
by pt:gen slope=sys_diff/obstime
by pt:gen predict=(140-bp_systolic)/slope if bp_systolic<140 &
bp_systolic[_n+1]>=140
by pt:gen date140=visit_date + predict
by pt:gen predict2=floor([140-bp_systolic]/slope) if bp_systolic>=140 &
bp_systolic[_n+1]<140
by pt:gen date140down=visit_date[_n-1] - predict2
by pt:gen out_range=obstime if bp_systolic>=140 & bp_systolic[_n+1]>=140
by pt: replace out_range=visit_date[_n+1]- date140 if bp_systolic<140 &
bp_systolic[_n+1]>=140
by pt: replace out_range=obstime- predict2 if bp_systolic>=140 &
bp_systolic[_n+1] <140
gen time=.

levels pt, local(levels)
quietly foreach l of local levels {
sum obstime if pt==`l'
local total=r(sum)
replace time=`total' if pt==`l'
}
gen time_out=.
quietly foreach l of local levels {
sum out_range if pt==`l'
local total =r(sum)
replace time_out=`total' if pt==`l'
}
gen time_o_r=(time_out/time)
local threshold = 140
  gen proportion=.
levels pt, local(levels)
 quietly foreach l of local levels {
   count if pt == `l' & bp_systolic !=.
    local total =r(N)
    count if bp_systolic >= `threshold' & bp_systolic !=. & pt == `l'
    replace proportion = r(N)/`total' if pt == `l'
}
Any suggestions for using a different set of programming statements???


*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index