Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: differences of mean

From   Nick Cox <>
Subject   Re: st: differences of mean
Date   Tue, 27 Mar 2012 13:18:37 +0100

As others have indicated, there are _much_ better regression-based
approaches to what appears to be the underlying research question
here. There is also a separate issue of whether it is better to work
with wage or its logarithm.

In contrast here I focus on the pure Stata issues of generating
variables with conditional means, as they are needed in other
circumstances (e.g. for graphics).

Consider Chiara's code:

bys occupation: egen wage=mean(lwage)
ge wage1f=wage if occupation==1 & fem==1
ge wage1m=wage if occupation==1 & fem==0

She wants the mean for each occupation and each gender. But the first
statement mixes the genders together. For that reason the next two
statements cannot identify separate means for different genders. The
results will be the same for both genders and a given occupation.

This code would do what Chiara seems to want. I switch to a generic
response -y-:

bysort occupation : egen y_f = mean(y / (fem == 1))
bysort occupation : egen y_m = mean(y / (fem == 0))

Note that dividing by 1 gives the numerator and dividing by 0 gives
missing. Missings are ignored by -egen- in this case.

See also

Cox, N.J. 2011. Compared with .... Stata Journal 11(2): 305-314

Abstract.  Many problems in data management center on relating values
to values in other observations, either within a dataset as a whole or
within groups such as panels. This column reviews some basic Stata
techniques helpful for such tasks, including the use of subscripts,
summarize, by:, sum(), cond(), and egen. Several techniques exploit
the fact that logical expressions yield 1 when true and 0 when false.
Dividing by zero to yield missings is revealed as a surprisingly
valuable device.

On Tue, Mar 27, 2012 at 12:41 PM, Chiara Mussida <> wrote:

> I have to calculate the difference of mean log wages of men and women.
> My dataset contains the variable lwage which is the log of wages. I
> tried to generate the mean wage (also by occupation, for a more
> detailed difference):
> bys occupation: egen wage=mean(lwage)
> ge wage1f=wage if occupation==1 & fem==1
> ge wage1m=wage if occupation==1 & fem==0
> ge diff1=wage1m - wage1f if occupation==1
> but this gave me a variable diff1 with no observations, since the mean
> lwage for men is missing when the mean lwage for men it is not, and
> viceversa. Again, if I do replace the missing values of men and women
> with 0 this gives me false results (I know there is a difference
> between missing and 0).
> How should I get my variable diff= mean(lwage men) - mean(lwage
> women)? Total difference and/or difference by occupation.
*   For searches and help try:

© Copyright 1996–2017 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index