Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
"Nick Cox" <n.j.cox@durham.ac.uk> |

To |
<statalist@hsphsun2.harvard.edu> |

Subject |
RE: st: RE: creating variables using 'by' for subsets of records |

Date |
Thu, 25 Feb 2010 17:50:36 -0000 |

I don't think this is nearly as difficult as it may seem. Only in complicated cases is it likely to be necessary to calculate numerator and denominator separately, and the necessity is even then often psychological rather than logical. It can do little harm to set out the calculation in stages, and that may provide clearer documentation of what was done. That said, a simple technique is to hand. With the redoubtable auto data, the alternatives egen prforeign = mean(foreign == 1), by(rep78) bysort rep78 : egen prforeign = mean(foreign == 1) give the main idea. The main pitfall with that approach is that missings are going to be included, so that foreign == . is part of the complement of foreign == 1 In the auto dataset, there are no missings for -foreign-, so that doesn't bite, but a safer example to follow would be egen prforeign = mean(foreign == 1) if !missing(foreign), by(rep78) bysort rep78 : egen prforeign = mean(foreign == 1) if !missing(foreign) The expression here foreign == 1 stands in for any true-or-false definition of an indicator. If percents are needed rather than proportions, then it is evident that the results can just be multiplied by 100, and that the best way to do that is within -egen-: egen pcforeign = mean(100 * (foreign == 1)) if !missing(foreign), by(rep78) bysort rep78 : egen pcforeign = mean(100 * (foreign == 1)) if !missing(foreign) Nick n.j.cox@durham.ac.uk Martin Weiss Several issues pose themselves here. A solution could be something like this: ************* clear* inp str10 Individual region Indicator A 1 0 B 1 1 C 2 1 D 2 1 end encode Individual, gen(id) compress bys region: egen countofindicator =total(Indicator) by region: gen ratio=countofindicator/_N li, noo ************* " I have attempted to use bys region: egen y=count if Indicator==1 but receive an invalid syntax error. " The -egen- function "count()" demands an argument which it is supposed to count. That is why you get an error. Furthermore, the "count" function merely counts nonmissing arguments, no matter what their value may be. The appropriate function to sum up values is -total()-, as seen in the example. To get a total of all observations defined by the -by- groups, you can also use "_N", see [U], 13.7.2. (And do not even get started on -sum()-, see http://www.stata.com/statalist/archive/2009-04/msg00699.html.) " As an aside, is there a way to specify the variable y/x without specifying y and x?" Finally, getting all of this in one fell swoop is difficult, or rather impossible once an -egen- function is involved. Doing this step by step helps when debugging... John Westbury Thanks much for the feedback. Here is an example of what the data looks like that I am using: Individual region Indicator A 1 0 B 1 1 C 2 1 D 2 1 I have encoded the regions and the ratio I am attempting to create would be intuitively expressed as: by region: count of indicator==1/count of individual. I am trying to create a variable for the numerator by region (call it y) and denominator by region (call it x) and then use gen ratio=y/x. I can create a variable (x) for the denominator using; bys region: egen x=count(Indicator). I am having trouble creating a variable for the numerator. I have attempted to use bys region: egen y=count if Indicator==1 but receive an invalid syntax error. If someone has a suggestion on how to specify a variable for a count of indicator==1 by region I would be very appreciative. As an aside, is there a way to specify the variable y/x without specifying y and x? Martin Weiss > In the absence of example data, it is hard to give you advice. Look at this > calculation of regional unemployment rates: > > > ******* > clear* > > //10 regions > set obs 10 > gen byte region=_n > > //50 indiv per region > expand 50 > bys region: gen byte id=_n > gen byte unemployed=runiform()>.9 > > bys region: gen number=_N > by region: egen numofunempl=total(unemployed) > > gen unemprate=numofunempl/number > ******* John Westbury > I have records for individuals by geographic region and wish to aggregate > the records for individuals to records for geographic regions. I believe I > should create variables for those regions using 'by'. Ex: by Region gen x > = > argument for variable. I am having difficulty with arguments for variable > x. For example I wish to create a region variable that expresses a ratio > of > count of indicator values for individuals in a region to a count of > individuals in the region and am unsure how to code this. * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**References**:**st: creating variables using 'by' for subsets of records***From:*John Westbury <jrwestbury@gmail.com>

**Re: st: RE: creating variables using 'by' for subsets of records***From:*John Westbury <jrwestbury@gmail.com>

**AW: st: RE: creating variables using 'by' for subsets of records***From:*"Martin Weiss" <martin.weiss1@gmx.de>

- Prev by Date:
**Re: st: AW: create a dataset or vector of variable names** - Next by Date:
**RE: st: AW: create a dataset or vector of variable names** - Previous by thread:
**AW: st: RE: creating variables using 'by' for subsets of records** - Next by thread:
**Re: st: interactions for xtmelogit models** - Index(es):