Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: RE: Extensions to: Creating variables recording properties of the other members of a group


From   "Nick Cox" <n.j.cox@durham.ac.uk>
To   <statalist@hsphsun2.harvard.edu>
Subject   st: RE: Extensions to: Creating variables recording properties of the other members of a group
Date   Thu, 29 Aug 2002 12:50:09 +0100

Guillermo Cruces
>
> When working with twin household/individual datasets, this
> is one of the most
> useful FAQs:
> http://www.stata.com/support/faqs/data/members.html
> However, there are a few issues I couldn't solve with the
> information included
> there, or not efficiently at least. I would like to solve
> my problem and, if
> worthwhile,  write an extension of the FAQ. The problem
> refers to the fact that
> sometimes you may need to record for one individual the
> properties not of the
> whole group, but of another member of the group in particular.
> In my example, I have a household survey where I don't have
> direct information
> about the number of kids of each individual, but I have
> something like this:
> hhid and member are just the household id and number of
> member. Variables
> fatherm and motherm tell you the number of the member of
> the father and the
> mother, if in the household:
> hhid member fatherm  motherm
> 1    1    -    -
> 1    2    -    -
> 1    3    1    2
> 1    4    1    2
> 1    5    1    2
>
> 2    1    -    -
> 2    2    -    1
> 2    3    -    2
> ...
> Family one is a couple with three kids. Family two is a
> grandma, the daughter,
> and a grandchild.

It seems that member 3 of family 2 could be
a son, on this information.

> I want to create the variable ownkids that gives me the
> number of own kids
> living in the house:
> hhid member ownkids
> 1    1    3
> 1    2    3
> 1    3    0
> 1    4    0
> 1    5    0
>
> 2    1    1
> 2    2    1
> 2    3    0
>
> My force brute solution, which makes a lot of unnecessary
> comparisons and takes
> very long (because I generate and drop many variables) is
> of the form: with
> maxmem being the number of members of each household (group
> i, max is the number
> of groups),
> forvalues i = 1/`max' {
>      qui sum member if group==`i'
>      local maxmem=r(max) forvalues j = 1/`maxmem' {
>      di "-----------Household number `i', number of
> members: `maxmem'"
>      forvalues k = 1/`maxmem' {
>           di "Household `i', member `j', comparing with `k'"
>           qui gen a=motherm==`j' if member==`k'&group==`i'
>           qui egen b=max(a)
>           qui replace mkids=mkids+b if member==`j'&group==`i'
>           drop a b
>           qui gen a=fatherm==`j' if member==`k'&group==`i'
>           qui egen b=max(a)
>           qui replace fkids=fkids+b if member==`j'&group==`i'
>           drop a b
>           }
>      }
> }
>
> This creates two variables, mkids and fkids, which are the
> number of kids for
> mothers and fathers. For each member of the household, I
> compare if . The egen,
> replace, drop, takes very long, and even longer if the
> dataset in memory is
> large (I had to partition the dataset in 25 parts to make
> this run faster).
> The main problem (the main awkwardness in this program) is
> that I gen, egen,
> etc. because I could not just create a scalar that reflects
> the value of a
> variable for one precise observation, something of the form
> (which of course
> doesn't work):
> local a=mother==`j'    if member==`k'&group==`i' (meaning:
> mother etc. should
> refer to the observation: member==`k'&group==`i')
> I coudn't use something like motherm[_...] becauseI was not
> using by: ... .
> What I would like to know if there are more efficient ways
> of doing this (I'm
> sure there are!).

I'm pleased that this FAQ is useful.

Marcela Perticara posted one solution. Here is another.
It is rather heavy on Stataish techniques, so the
explanation will be much longer than the code.

I will take the "-" to be numeric missings.

. l

           hhid      member     fatherm     motherm
  1.          1           1           .           .
  2.          1           2           .           .
  3.          1           3           1           2
  4.          1           4           1           2
  5.          1           5           1           2
  6.          2           1           .           .
  7.          2           2           .           1
  8.          2           3           .           2

We first find the number of kids belonging to each father

. bysort hhid fatherm : gen fkids = _N if fatherm < .
(5 missing values generated)

and similarly to each mother

. bysort hhid motherm : gen mkids = _N if motherm < .
(3 missing values generated)

Now we initialise the variable to be produced

. gen ownkids = 0

I will loop over the values of -member- within
each family. We can see in the example that these
range from 1 to 5, but more generally we can pick up the maximum
from -summarize-:

. su member, meanonly

Now my main loop, which I will give first and then unpack:

forval i = 1 / `r(max)' { /* mailer bug protection */
	gen ischild = (fatherm == `i') | (motherm == `i')
	gen isnotmember = member != `i'

	* next four lines are all one command

      qui bysort hhid (ischild isnotmember) :
      replace ownkids =
      cond(motherm[_N] == `i', mkids[_N], fkids[_N])
	if _n == 1 & ischild[_N]

	drop ischild isnotmember
}

As we go round the -forval- loop the local macro i is varied
from 1 to the maximum observed -member-.

So we start the loop with member 1.

Members of the family are children of this member
if this member is their father or this member is their
mother. Stata substitutes 1 for `i':

	gen ischild = (fatherm == 1) | (motherm == 1)

This indicator variable will be 0 (is not a child of 1) or
1 (is a child of 1). -gen byte ischild ...- would be
better technique.

Within each household, we are going to sort
on this variable, so that all the children of
1 come at the end of each household. Then we
can pick up the number of children from the
other variables in the last observation,
subject to conditions to be mentioned in a moment.

If the children of member 1 (if any) go at the end,
then member 1 must go at the beginning of the
household group to allow things to be done
systematically. For this we need an inverted
indicator variable

	gen isnotmember = member != 1

which will be 0 (this person is member 1, so the
condition that -member- is not equal to 1 is false)
or 1 (this person is not member 1, so the condition
is true).

      qui bysort hhid (ischild isnotmember) :
      replace ownkids =
      cond(motherm[_N] == `i', mkids[_N], fkids[_N])
	if _n == 1 & ischild[_N]

This is a lot in one statement, and is best taken in pieces:

1. qui bysort hhid (ischild isnotmember):

	We are going to do a -replace- separately by
	households. Within each household, we sort first
	so that the children of this member go to the
	end of the household (whenever there are any)
	and this member goes to the beginning. This
	could equally be ... (isnotmember ischild)
	as a member of a household cannot be his
	or her own child. As always, sorting puts
	lowest first, so all values of 0 come before
	all values of 1 for indicator variables.
	Also, we do all this -quietly-, although
	that's not essential.

2. replace ownkids = ... if _n == 1 & ischild[_N]

	We are going to replace -ownkids- but only in
	the first observation in each household --
	which after our sort contains the -member- we
	are focusing on -- and only if the last
	person in the household is a child of this
	-member-. What is crucial here is that under
	-by <varlist>:- _n and _N are interpreted within
	each group defined by the <varlist>. So
      -if _n == 1-
	selects the first member in each household
	and -ischild[_N]- is for the last member of
	each household. (-ischild[_N]- is a shortcut
	for -ischild[_N] == 1- as they always
	evaluate to the same result.)

3. What are we going to replace -ownkids- with?

	The condition -ischild[_N]- ensures that
	we will only -replace- values when the last
	observation in each household is for a child
	of the person in the first observation. If
	that person is a mother, we use the value
	for -mkids-; if not, we use the value for
	-fkids-:

	cond(motherm[_N] == `i', mkids[_N], fkids[_N])

Finally in the loop we -drop- the indicators:

	drop ischild isnotmember

Another way to work with indicators would be to initialise the
indicators outside the -forval- and to -replace- them
each time round the loop.

We went through the operations for member 1; -forval- repeats
them for the other members.

. sort hhid member

. l

           hhid      member     fatherm     motherm      fkids
mkids    ownkids
  1.          1           1           .           .          .
.          3
  2.          1           2           .           .          .
.          3
  3.          1           3           1           2          3
3          0
  4.          1           4           1           2          3
3          0
  5.          1           5           1           2          3
3          0
  6.          2           1           .           .          .
.          1
  7.          2           2           .           1          .
1          1
  8.          2           3           .           2          .
1          0

I'll summarize in terms of Stata techniques.

* We are better off doing things -by hhid:- than
looping over households (and even better off than
looping over observations). As Guillermo observes,
that can be very slow. With -by:-

* I couldn't see a way of doing this without
a loop over the members of the household.
Can anyone?

* A loop often means initialising a variable outside
the loop and then doing some -replace-s
each time round the loop.

* What is generic for this kind of problem is
the need to look at other observations within
the same household. That is often best done
systematically by sorting so the observations
with the needed values are at the end of
each household, or at the beginning of
each household, or both. Then subscripting
with [_N] or [1] picks up the values we want.

* Indicator variables (direct or inverted)
help getting the right sort order.

I haven't tested on anything other than the
example given.

Here is the code in one chunk

bysort hhid fatherm : gen fkids = _N if fatherm < .
bysort hhid motherm : gen mkids = _N if motherm < .
gen ownkids = 0
su member, meanonly

forval i = 1 / `r(max)' { /* mailer bug protection */
	gen ischild = (fatherm == `i') | (motherm == `i')
	gen isnotmember = member != `i'
	#delimit ;
	bysort hhid (ischild isnotmember) :
	replace ownkids =
	cond(motherm[_N] == `i', mkids[_N], fkids[_N])
	if _n == 1 & ischild[_N] ;
	#delimit cr
	drop ischild isnotmember
}

Nick
n.j.cox@durham.ac.uk

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index