From
Daniel Escher <descher@nd.edu>

To
statalist@hsphsun2.harvard.edu

Subject
Re: st: Counting observations within groups

Date
Sat, 1 Dec 2012 08:57:10 -0500

Nick, thank you for your insights and for pointing out that it is safer to specifically store the mean as a local rather than rely on Stata's temporary memory of scalars. I tried your code below with the addition of a condition about missing data, and it worked well (roughly as fast as Austin's code): su totprod, mean loc m=r(mean) su totprod, mean local m = r(mean) egen big= total(totprod>`m' & totprod<. & (sic==12110|sic==11110)), by(fips) I had tried something similar (my Attempt 2) but without the necessary parentheses. Those make such a difference in this case. On Fri, Nov 30, 2012 at 10:07 AM, Nick Cox <njcoxstata@gmail.com> wrote: > totprod > `m' > > won't work unless the local macro `m' is defined. Two lines in > Austin's code not cited here showed how to do that > > su totprod, mean > loc m=r(mean) > > I can't test for your data, but > > su totprod, mean > local m = r(mean) > egen big= total(totprod>`m' & (sic==12110|sic==11110)), by(fips) > > is think equivalent. > > Also, > > su totprod, mean > egen big= total(totprod>`r(mean)' & (sic==12110|sic==11110)), by(fips) > > is equivalent to that. > > su totprod, mean > egen big= total(totprod>r(mean) & (sic==12110|sic==11110)), by(fips) > > is living more dangerously as interpretation of r(mean) is postponed > until within -egen-. > > The -egen- route is unlikely to be faster computatioally because > -egen- includes several lines of interpreted code; > all the important ones and none of the unimportant ones are in > Austin's code. However, it might be easier to work out in real time > that this is code that should work. > > I attempted a survey of little methods in similar territory in > > SJ-11-2 dm0055 . . . . . . . . . . . . . . Speaking Stata: Compared with ... > . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N. J. Cox > Q2/11 SJ 11(2):305--314 (no commands) > reviews techniques for relating values to values in other > observations > > The common ground is realised when you see that the argument of (in > this case) -egen, total()- can be an _expression_ (which can be > (much) more complicated than a variable name). > > On Fri, Nov 30, 2012 at 1:12 PM, Daniel Escher <descher@nd.edu> wrote: >> Austin, >> >> Thank you so much! I had forgotten about using levelsof to create a >> local of all values in a variable. In this case, your third option was >> computationally quickest, but I'll keep the first two options in my >> head for later situations. For some reason, totprod>`m' needed to be >> changed to totprod>r(mean). Thus, >> >> su totprod, mean >> g big=(totprod>r(mean)&totprod<.)&(sic==12110|sic==11110) >> by fips: g sbig=sum(big) >> by fips: replace sbig=sbig[_N] >> >> >> On Thu, Nov 29, 2012 at 6:03 PM, Austin Nichols <austinnichols@gmail.com> wrote: >>> Daniel Escher <descher@nd.edu>: >>> >>> I sent my prior post a bit prematurely... I meant to go on to say-- >>> but one does not need a loop for this particular problem. >>> >>> Make a dummy, sum within county: >>> >>> su totprod, mean >>> g big=(totprod>`m'&totprod<.)&(sic==12110|sic==11110) >>> bys fips: g sbig=sum(big) >>> by fips: replace sbig=sbig[_N] >>> >>> On Thu, Nov 29, 2012 at 5:48 PM, Daniel Escher <descher@nd.edu> wrote: >>>> Hello, >>>> >>>> I am trying to count the number of mines in a county by production. >>>> I.e., I'd like the number of mines in each county that are above the >>>> overall mean of production, and the number that are below. There are >>>> multiple mines per county, which is identified by its FIPS code. >>>> Missing data are marked by . The data are in long format. >>>> >>>> Here's what I have so far: >>>> . *bigmines = # of mines in a county above the overall mean >>>> . *totprod = total production per mine >>>> . *sic = type of mine >>>> >>>> . *ATTEMPT ONE >>>> . sort fips >>>> . su totprod // to get mean >>>> . by fips: egen bigmines = count(inrange(totprod, r(mean), .) & >>>> sic==12110 | sic==11110) // This gives me total number of mines per >>>> FIPS code - not those that meet the criteria >>>> . drop bigmines >>>> >>>> . *ATTEMPT TWO >>>> . su totprod // to get mean >>>> . by fips: egen bigmines = total(mshahrs > r(mean) & sic==12110 | >>>> sic==11110) // This gives me the total number of mines per FIPS code >>>> if any mine exceeds the mean >>>> . drop bigmines >>>> >>>> . *ATTEMPT THREE >>>> . *Then I read Nick Cox's helpful article >>>> (http://www.stata-journal.com/sjpdf.html?articlenum=pr0029) which >>>> clued me in to -count-: >>>> . gen bigmines = 0 >>>> . su totprod >>>> . count if inrange(totprod, r(mean), .) & sic==12110 | sic==11110 >>>> . replace bigmines = r(N) >>>> >>>> The last attempt is what I want, and it "works." However, I don't know
how to -count- and then store r(N) for each FIPS code. Using -by- does
not seem to work. This probably requires a loop like...

forvalues j = all values of fips {
count if inrange(mshahrs, r(mean), .) & sic==12110 | sic==11110
replace bigmines_hrs = r(N)
}

Is this close? Thank you so much for your help and time.

