Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: bsample

From (Jeff Pitblado, Stata Corp.)
Subject   Re: st: bsample
Date   Wed, 25 Jun 2003 12:21:45 -0500

<> asks about bootstrapping -clogit- results, using the
-cluster()- option of -bootstrap-:

> I am using Stata 7.
> Given that -clogit- doesn't have the option of clustered standard errors, I
> performed bootstrap to correct them.
> This is the code:
> #delimit;
> set more 1;
> set matsize 800;
> set seed 1;
> bs "clogit choice private public time cost distpri distpub incpri incpub, 
>  group(id)" "_b[time] _b[cost] _b[distpri] _b[distpub] _b[incpri] _b[incpub]", 
>  cluster(area) reps(0) saving(bsclog) replace;
> I got my results, but it took Stata 2 hours to compute the se with 0
> replications, and 6 hours with 200 replications. -clogit- on the same data
> (456399 obs when arranged in the long format) takes 1mn to run and I'm
> running it on the University network. When I included the controls, it took
> Stata 1 month to compute the se, for a model it takes 13mn to run with
> clogit!!!!!
> I went a that point to try and check what was making it so slow, and decided to
> draw the random sample manually, then do -clogit-. 
> This is the code for 1 draw only: (I'm planning to do a loop for the number of
> replications required once I solve the problem below)
> set seed 1
> set matsize 800
> set more 1
> bsample, cluster(area)
> clogit choice private public time cost distpri distpub incpri incpub, group(id)
> I got mixed results in the sense that the speed at which I obtained the
> results was as expected, but -bsample- is mixing up the data as I should have
> 1:2 matching (McFadden choice model), and I get 4:8.
> So summing-up, my problem with -bsample- is how to incorporate the id so I
> could have the appropiate matching.

Bootstrapping -clogit- (in the absence of clusters)

Let's begin by discussing how to bootstrap results from -clogit-; we'll talk
about -clogit- with clustered groups later.

The -clogit- command requires grouped data.  Thus, when bootstrapping the
results from -clogit-, you need to sample the groups (each group as a whole)
instead of the observations.  That is, each group is itself a cluster of
information, thus use the -cluster()- option of -bootstrap- to sample the

It is usually the case that we need to specify the "cluster" variable in the
estimation command.  For -clogit-, we identify this variable in the -group()-
option.  Remember, this variable identifies the groups we are sampling with
replacement, thus each group that is sampled more than once must have a unique
identifier.  That is, if the group with "id==1" is sampled twice, the repeat
group must have a different identifying value than the original.  This is
accomplished using the -idcluster()- option.

Here we bootstrap the results from the first example in [R] clogit.

***** BEGIN:
version 7
gen myid = id
bs "clogit y x1 x2, group(myid)" "_b[x1] _b[x2]", cluster(id) idclust(myid) /*
	*/ dot
***** END:

Notice that -bs- will produce cluster samples using the -id- variable, but
will call -clogit- using the -myid- variable to identify the groups.  -myid-
contains unique values for each sampled group.

Clustered -clogit-

"uctpmtd" has a slightly more complicated situation.  There are clusters of
groups, so we need to sample the clusters with replacement, but still uniquely
identify the sampled groups.  The -bs- command cannot handle this without a
little help from the user.

If -bs- were to supply me with the -group()- and -idcluster()- variables, I
could generate a new group variable that uniquely identified the sampled
groups (across the clusters), then run the -clogit- command with the new group
variable.  The following details how I accomplished this.

Using the data from the above example, I artificially create a cluster
variable -clust-, each containing at most 5 groups.

***** BEGIN:
version 7
set seed 1234

* generate a cluster variable
sort id
by id: gen clust = _n==1
replace clust = 1+mod(sum(clust),5)
***** END:

In order to ensure that -clogit- gets uniquely identified groups, while
sampling the clusters with replacement, I wrote a short program and placed it
in an ado-file: myclogit.ado (listed below).

-myclogit- is a wrap-around to -clogit-.  Its purpose is to generate a new
group variable from the original group variable and the -idcluster()-
variable.  The variables and options are passed through to -clogit-.

***** BEGIN: myclogit.ado
program define myclogit
	version 7
	syntax varlist , group(varname) idcluster(varname) [ * ]

	/* preserve original order within -group()- */
	tempvar newgroup order
	gen `order' = _n

	/* generate a new group id variable */
	sort `idcluster' `group' `order'
	by `idcluster' `group' : gen `newgroup' = _n==1
	replace `newgroup' = sum(`newgroup')

	clogit `varlist' , group(`newgroup') `options'
***** END:   myclogit.ado

With -myclogit- I can now use -bs- to bootstrap the standard errors of the
coefficients, while accounting for clustering of groups.

***** BEGIN
gen myclust = clust
bs "myclogit y x1 x2, group(id) idcluster(myclust)" "_b[x1] _b[x2]", /*
	*/ cluster(clust) idclust(myclust) dot
***** END


Remember that when a group has multiple choices, -clogit- must account for all
possible choice combinations.  In "uctpmtd"'s first attempt to bootstrap
results from -clogit-, each group that was sampled multiple times was causing
-clogit- to go through that much more work.  "uctpmtd" should not experience
this if -myclogit- is used as described above.

*   For searches and help try:

© Copyright 1996–2015 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index