Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

# RE: st: RE: creating variables using 'by' for subsets of records

 From "Nick Cox" To Subject RE: st: RE: creating variables using 'by' for subsets of records Date Thu, 25 Feb 2010 17:50:36 -0000

```I don't think this is nearly as difficult as it may seem.

Only in complicated cases is it likely to be necessary to calculate
numerator and denominator separately, and the necessity is even then
often psychological rather than logical. It can do little harm to set
out the calculation in stages, and that may provide clearer
documentation of what was done.

That said, a simple technique is to hand. With the redoubtable auto
data, the alternatives

egen prforeign = mean(foreign == 1), by(rep78)

bysort rep78 : egen prforeign = mean(foreign == 1)

give the main idea. The main pitfall with that approach is that missings
are going to be included, so that

foreign == .

is part of the complement of

foreign == 1

In the auto dataset, there are no missings for -foreign-, so that
doesn't bite, but a safer example to follow would be

egen prforeign = mean(foreign == 1) if !missing(foreign), by(rep78)

bysort rep78 : egen prforeign = mean(foreign == 1) if !missing(foreign)

The expression here

foreign == 1

stands in for any true-or-false definition of an indicator.

If percents are needed rather than proportions, then it is evident that
the results can just be multiplied by 100, and that the best way to do
that is within -egen-:

egen pcforeign = mean(100 * (foreign == 1)) if !missing(foreign),
by(rep78)

bysort rep78 : egen pcforeign = mean(100 * (foreign == 1)) if
!missing(foreign)

Nick
n.j.cox@durham.ac.uk

Martin Weiss

Several issues pose themselves here. A solution could be something like
this:

*************
clear*

inp str10 Individual region Indicator
A 1 0
B 1 1
C 2 1
D 2 1
end

encode Individual, gen(id)
compress

bys region: egen countofindicator =total(Indicator)
by region: gen ratio=countofindicator/_N
li, noo
*************

" I have attempted
to use bys region: egen y=count if Indicator==1 but receive an invalid
syntax error.  "

The -egen- function "count()" demands an argument which it is supposed
to
count. That is why you get an error. Furthermore, the "count" function
merely counts nonmissing arguments, no matter what their value may be.
The
appropriate function to sum up values is -total()-, as seen in the
example.
To get a total of all observations defined by the -by- groups, you can
also
use "_N", see [U], 13.7.2.

(And do not even get started on -sum()-, see
http://www.stata.com/statalist/archive/2009-04/msg00699.html.)

" As an aside, is there a way to specify the variable y/x without
specifying
y
and x?"

Finally, getting all of this in one fell swoop is difficult, or rather
impossible once an -egen- function is involved. Doing this step by step
helps when debugging...

John Westbury

Thanks much for the feedback.  Here is an example of what the data looks
like that I am using:

Individual region Indicator  A 1 0  B 1 1  C 2 1  D 2 1
I have encoded the regions and the ratio I am attempting to create would
be
intuitively expressed as:
by region: count of indicator==1/count of individual.

I am trying to create a variable for the numerator by region (call it y)
and
denominator by region (call it x) and then use gen ratio=y/x.
I can create a variable (x) for the denominator using; bys region: egen
x=count(Indicator).
I am having trouble creating a variable for the numerator.  I have
attempted
to use bys region: egen y=count if Indicator==1 but receive an invalid
syntax error.  If someone has a suggestion on how to specify a variable
for
a count of indicator==1 by region I would be very appreciative.

As an aside, is there a way to specify the variable y/x without
specifying y
and x?

Martin Weiss

> In the absence of example data, it is hard to give you advice. Look at
this
> calculation of regional unemployment rates:
>
>
> *******
> clear*
>
> //10 regions
> set obs 10
> gen byte region=_n
>
> //50 indiv per region
> expand 50
> bys region: gen byte id=_n
> gen byte unemployed=runiform()>.9
>
> bys region: gen number=_N
> by region: egen numofunempl=total(unemployed)
>
> gen unemprate=numofunempl/number
> *******

John Westbury

> I have records for individuals by geographic region and wish to
aggregate
> the records for individuals to records for geographic regions.  I
believe
I
> should create variables for those regions using 'by'.  Ex: by Region
gen x
> =
> argument for variable.  I am having difficulty with arguments for
variable
> x.  For example I wish to create a region variable that expresses a
ratio
> of
> count of indicator values for individuals in a region to a count of
> individuals in the region and am unsure how to code this.

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```