Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: Re: st: Correcting for self selection

Subject   RE: Re: st: Correcting for self selection
Date   Sat, 29 Jan 2011 19:21:45 +0100

Thank you so much Maarten and Clyde for your input and help.

The panel data structure is unbalanced. Some, but not all, organizations enter the contest multiple times.
Each entry represents one entry into the competition. Some organizations may enter more than one project a given month.
Each month is a new competition, and one winner is awarded each month.
The variable win means "wins this time". The distribution of wins is very skewed, with some organizations winning a lot more than most others.

The main explanatory variables are calculated based on organizational characteristics (no of contest wins and network centrality) from the previous year. 
I lag these characteristics by one year in an attempt to control for reverse causality issues.

I have calculated a variable that counts the number of entries per organization per month "no_entries".

Since I have data on the entire population for the contest, and the observations are neither random nor independent, I am contemplating a population averaged model (gee).

The model as of now looks like this:

xtlogit depvar indep vars, i(org_id) offset(ln_no_entries) pa vce(robust)

or alternatively:

xtgee depvar indep vars, i(org_id) offset(ln_no_entries) link(logit) family(binomial) vce(robust)

The endogeneity mentioned might be present. E.g. that organizations that win beome more central over time. 

Thanks again. Any and all input is very much appreciated.

All the best,


-----Forwarded by Erik Aadland/people/BISTIFT on 01/29/2011 06:57PM -----

To: From: "Clyde Schechter" <>
Sent by:
Date: 01/29/2011 06:31PM
Subject: Re: st: Correcting for self selection

Erik does not provide the details of his modeling, but I'm inferring from
what he wrote that he is trying to do something like:

xtlogit win indvars

and is concerned that the organizations with the highest scores on the
indvars tend to be those who participate in the contests most often.

If the outcome variable, win, means "ever wins" and there is just one
record per competitor summarizing its total participation history, then
there is a need to enter the number of attempts into the model.  -xtlogit-
doesn't offer a simple way to do that.  If it were not panel data, -glm-
with -link(logit) family(binomial n_attempts)- options would do it. 
Closest to that for panel data would be -xtgee- with the same
options--though it uses a population averaged estimator, which may not
handle things exactly as Erik wants.

But I'm wondering in what sense this is panel data.  If it is really panel
data, I'm expecting that there are multiple records per competitor, each
representing one entry into the competition (with the variable win meaning
"wins this time."  In that case, this aspect of frequency of participation
is automatically accounted for without any special treatment.

A different issue altogether is the possibility that participation in the
competitions is itself affected by the values of depvars, a kind of
endogeneity.  There is no truly simple fix for that problem and one might
need to resort to something like structural equations modeling of the
entire system, or other complicated approaches.  I don't know enough to
really be more specific about this.

Clyde Schechter, MA MD
Associate Professor of Family & Social Medicine

Please note new e-mail address:

*   For searches and help try:
*   For searches and help try:

© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index