Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Counting observations within groups


From   Daniel Escher <descher@nd.edu>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Counting observations within groups
Date   Sat, 1 Dec 2012 08:57:10 -0500

Nick, thank you for your insights and for pointing out that it is
safer to specifically store the mean as a local rather than rely on
Stata's temporary memory of scalars. I tried your code below with the
addition of a condition about missing data, and it worked well
(roughly as fast as Austin's code):

su totprod, mean
loc m=r(mean)
su totprod, mean
local m = r(mean)
egen big= total(totprod>`m' & totprod<. & (sic==12110|sic==11110)), by(fips)

I had tried something similar (my Attempt 2) but without the necessary
parentheses. Those make such a difference in this case.

On Fri, Nov 30, 2012 at 10:07 AM, Nick Cox <njcoxstata@gmail.com> wrote:
> totprod > `m'
>
> won't work unless the local macro `m' is defined. Two lines in
> Austin's code not cited here showed how to do that
>
> su totprod, mean
> loc m=r(mean)
>
> I can't test for your data, but
>
> su totprod, mean
> local m = r(mean)
> egen big= total(totprod>`m' & (sic==12110|sic==11110)), by(fips)
>
> is think equivalent.
>
> Also,
>
> su totprod, mean
> egen big= total(totprod>`r(mean)' & (sic==12110|sic==11110)), by(fips)
>
> is equivalent to that.
>
> su totprod, mean
> egen big= total(totprod>r(mean) & (sic==12110|sic==11110)), by(fips)
>
> is living more dangerously as interpretation of r(mean) is postponed
> until within -egen-.
>
> The -egen- route is unlikely to be faster computatioally because
> -egen- includes several lines of interpreted code;
> all the important ones and none of the unimportant ones are in
> Austin's code. However, it might be easier to work out in real time
> that this is code that should work.
>
> I attempted a survey of little methods in similar territory in
>
> SJ-11-2 dm0055  . . . . . . . . . . . . . .  Speaking Stata: Compared with ...
>         . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  N. J. Cox
>         Q2/11   SJ 11(2):305--314                                (no commands)
>         reviews techniques for relating values to values in other
>         observations
>
> The common ground is realised when you see that the argument of (in
> this case) -egen, total()-  can be an _expression_ (which can be
> (much) more complicated than a variable name).
>
> On Fri, Nov 30, 2012 at 1:12 PM, Daniel Escher <descher@nd.edu> wrote:
>> Austin,
>>
>> Thank you so much! I had forgotten about using levelsof to create a
>> local of all values in a variable. In this case, your third option was
>> computationally quickest, but I'll keep the first two options in my
>> head for later situations. For some reason, totprod>`m' needed to be
>> changed to totprod>r(mean). Thus,
>>
>> su totprod, mean
>> g big=(totprod>r(mean)&totprod<.)&(sic==12110|sic==11110)
>> by fips: g sbig=sum(big)
>> by fips: replace sbig=sbig[_N]
>>
>>
>> On Thu, Nov 29, 2012 at 6:03 PM, Austin Nichols <austinnichols@gmail.com> wrote:
>>> Daniel Escher <descher@nd.edu>:
>>>
>>> I sent my prior post a bit prematurely... I meant to go on to say--
>>> but one does not need a loop for this particular problem.
>>>
>>> Make a dummy, sum within county:
>>>
>>> su totprod, mean
>>> g big=(totprod>`m'&totprod<.)&(sic==12110|sic==11110)
>>> bys fips: g sbig=sum(big)
>>> by fips: replace sbig=sbig[_N]
>>>
>>> On Thu, Nov 29, 2012 at 5:48 PM, Daniel Escher <descher@nd.edu> wrote:
>>>> Hello,
>>>>
>>>> I am trying to count the number of mines in a county by production.
>>>> I.e., I'd like the number of mines in each county that are above the
>>>> overall mean of production, and the number that are below. There are
>>>> multiple mines per county, which is identified by its FIPS code.
>>>> Missing data are marked by . The data are in long format.
>>>>
>>>> Here's what I have so far:
>>>> . *bigmines = # of mines in a county above the overall mean
>>>> . *totprod = total production per mine
>>>> . *sic = type of mine
>>>>
>>>> . *ATTEMPT ONE
>>>> . sort fips
>>>> . su totprod // to get mean
>>>> . by fips: egen bigmines = count(inrange(totprod, r(mean), .) &
>>>> sic==12110 | sic==11110)  // This gives me total number of mines per
>>>> FIPS code - not those that meet the criteria
>>>> . drop bigmines
>>>>
>>>> . *ATTEMPT TWO
>>>> . su totprod // to get mean
>>>> . by fips: egen bigmines = total(mshahrs > r(mean) & sic==12110 |
>>>> sic==11110) // This gives me the total number of mines per FIPS code
>>>> if any mine exceeds the mean
>>>> . drop bigmines
>>>>
>>>> . *ATTEMPT THREE
>>>> . *Then I read Nick Cox's helpful article
>>>> (http://www.stata-journal.com/sjpdf.html?articlenum=pr0029) which
>>>> clued me in to -count-:
>>>> . gen bigmines = 0
>>>> . su totprod
>>>> . count if inrange(totprod, r(mean), .) & sic==12110 | sic==11110
>>>> . replace bigmines = r(N)
>>>>
>>>> The last attempt is what I want, and it "works." However, I don't know
>>>> how to -count- and then store r(N) for each FIPS code. Using -by- does
>>>> not seem to work. This probably requires a loop like...
>>>>
>>>> forvalues j = all values of fips {
>>>>         count if inrange(mshahrs, r(mean), .) & sic==12110 | sic==11110
>>>>         replace bigmines_hrs = r(N)
>>>> }
>>>>
>>>> Is this close? Thank you so much for your help and time.
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index