Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: RE: Extensions to: Creating variables recording properties of the other members of a group


From   "Nick Cox" <n.j.cox@durham.ac.uk>
To   <statalist@hsphsun2.harvard.edu>
Subject   st: RE: Extensions to: Creating variables recording properties of the other members of a group
Date   Thu, 29 Aug 2002 14:28:33 +0100

Guillermo Cruces

[ ... ]

> In my example, I have a household survey where I don't have
> direct information
> about the number of kids of each individual, but I have
> something like this:
> hhid and member are just the household id and number of
> member. Variables
> fatherm and motherm tell you the number of the member of
> the father and the
> mother, if in the household:

[ ... ]

> I want to create the variable ownkids that gives me the
> number of own kids
> living in the house:

[ ... ]

I replied to Guillermo's posting with a proffered
solution, but I didn't answer one of his questions.

> My force brute solution, which makes a lot of unnecessary
> comparisons and takes
> very long (because I generate and drop many variables) is
> of the form: with
> maxmem being the number of members of each household (group
> i, max is the number
> of groups),
> forvalues i = 1/`max' {
>      qui sum member if group==`i'
>      local maxmem=r(max) forvalues j = 1/`maxmem' {
>      di "-----------Household number `i', number of
> members: `maxmem'"
>      forvalues k = 1/`maxmem' {
>           di "Household `i', member `j', comparing with `k'"
>           qui gen a=motherm==`j' if member==`k'&group==`i'
>           qui egen b=max(a)
>           qui replace mkids=mkids+b if member==`j'&group==`i'
>           drop a b
>           qui gen a=fatherm==`j' if member==`k'&group==`i'
>           qui egen b=max(a)
>           qui replace fkids=fkids+b if member==`j'&group==`i'
>           drop a b
>           }
>      }
> }
>
> This creates two variables, mkids and fkids, which are the
> number of kids for
> mothers and fathers. For each member of the household, I
> compare if . The egen,
> replace, drop, takes very long, and even longer if the
> dataset in memory is
> large (I had to partition the dataset in 25 parts to make
> this run faster).
> The main problem (the main awkwardness in this program) is
> that I gen, egen,
> etc. because I could not just create a scalar that reflects
> the value of a
> variable for one precise observation, something of the form
> (which of course
> doesn't work):
> local a=mother==`j'    if member==`k'&group==`i' (meaning:
> mother etc. should
> refer to the observation: member==`k'&group==`i')
> I coudn't use something like motherm[_...] becauseI was not
> using by: ... .
> What I would like to know if there are more efficient ways
> of doing this (I'm
> sure there are!).

As indicated separately, this code is a triple
loop which can be reduced to at most one loop.
For the details, see my earlier posting.

But the steps

. egen b = max(a)
...
. drop b

could have been cut in a way that is of much wider
interest and applicability.

Guillermo wants just one number, the maximum. A good way to
get it is, in general,

. summarize a, meanonly

followed by

. scalar b = r(max)

or

. local b = r(max)

or just by using r(max) or `r(max)' directly after
the -summarize-

. qui replace fkids=fkids + r(max) if member==`j'&group==`i'

If you try this out for yourself, say with the auto
data

. su mpg, meanonly

you will see nothing! The point, however, is what -summarize-
leaves in its own wake. Type

. ret li

and you will see results which can be picked up
for subsequent use. Note in particular that

. su mpg, meanonly

is faster than

. su mpg

because the second also calculates the sd and the
variance. If you don't need either, you should
use the speedier command.

A separate point is that -egen- is an ado which
calls another ado, and so there is an overhead
for Stata which is obliged to interpret a few
dozen command lines. Done once, that is less
than a blink, but done repeatedly, it doesn't
help any process which is already too slow.

Some of these points were mentioned
in the recently posted -stylerules-
package on SSC:

Use -summarize, meanonly- for speed when its returned results are
sufficient.

Avoid -egen- within programs: it is usually slower than a direct
attack.

Never use a variable to hold a constant: a macro or a scalar is all
that
is needed.

Nick
n.j.cox@durham.ac.uk

P.S. On the last rule, I just found an exception. For
a graphical purpose, I need a variable which is a constant.
The variable defines a horizontal line, on which I show
the information from another variable, something like this

. gen bar = 0

. gra foo bar bazz, sy(o[anothervariable])

That's the trouble with style "rules": style is
a subject on which there are exceptions to every
rule you can think of, even this one.

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index