# Re: st: RE: Saving percentage distribution

 From zt22@cornell.edu To statalist@hsphsun2.harvard.edu Subject Re: st: RE: Saving percentage distribution Date Tue, 3 Dec 2002 14:14:57 -0500 (EST)

```Nick,

Sorry I did not describe the data. The two vars are part of a huge
dataset that has more than 100,000 observations. What I really want to do
is to use the percentages as weights to adjust for regression
coefficients. That is, I ran a regression on logincome with about 70
independent vars, 52 of which are dummies for industry. I save the
coefficients for these dummies as b1-b52 and then obtain the percentage
for each industry as p1-p52. The final product I want is the standard
deviation of the industry effects calculated by:
let i=1/52
egen mubar=sum(b`i' * p`i')
egen variance=sum(p`i' * ((b`i'- mubar)^2) )
gen sd=sqrt(variance)

I can get p`i' by counting the N for the whole sample and then counting
N`i' for each industry so that p`i'=N`i'/N. But this takes a lot of time
becuase I need to generate 52 dummy variables. I am wondering if there is
a faster way of doing this. Thanks very much.

Best,
Zun

On Tue, 3 Dec 2002, Nick Cox wrote:

> Zun
> >
> > I have two vars ind (52 categories) and occ (7 categories),
> > and I want
> > the percentage distribution of ind for each category of
> > occ. Note that
> > not each ind category has cases. For instance:
> >
> > Occ=1
> > ind     pct
> > 1       .0309522
> > 2       .0334331
> > 3	0
> > 4	.0356777
> > 5       .3402772
> > 6       .0294558
> > .       .
> > .       .
> > 52      .3151532
> >
> > Occ=2
> > ind     pct
> > 1       .0036623
> > 2       .0006301
> > 3	0
> > 4       .0064976
> > 5	0
> > 6       .0455619
> > .       .
> > .       .
> > 52      .0953769
> >
> > As shown above, ind=3 is not in both occ=1 and occ=2 while
> > ind=5 is in
> > occ=1 but not in occ=2.
> >
> > My questions are:
> >
> > First, if I use tabulate to get the percentage distribution of any
> > categorical variable, how can I save the percentages in a
> > new dataset
> > that looks like one of the tables above.
> >
> > Second, in the specific example above, is there a way I can
> > create a new
> > dataset that looks like this:
> >
> > ind     pctocc1         pctocc2
> > 1       .0309522        .0036623
> > 2       .0334331        .0006301
> > 3       0               0
> > 4       .0356777        .0064976
> > 5       .3402772        0
> > 6       .0294558        .0455619
> > .       .               .
> > .       .               .
> > 52      .3151532        .0953769
> >
>
> I guess that you have at most 52 * 7 observations.
> Forget -tabulate-: a direct calculation is better.
>
> Typing
>
> . findit percent
>
> does point to lots of things; but one pertinent is -egen-.
>
> . bysort occ : egen pctocc = pc(ind)
>
> followed by a -reshape- may help. You may need
> to -replace- any missings by 0.
>
>
> Nick
> n.j.cox@durham.ac.uk
> *
> *   For searches and help try:
> *   http://www.stata.com/support/faqs/res/findit.html
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```