Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: GMM minimization of regional errors imputed from hhd level model

From   Vladimír Hlásny <>
Subject   Re: st: GMM minimization of regional errors imputed from hhd level model
Date   Mon, 1 Jul 2013 13:22:02 +0900

(Also accessible on my page .
Unfortunately I don't know of a similar public dataset.)

use data, clear
sort region hhcode
by region: gen oneiffirst=_n
by region: egen surveyedhh_psu=max(oneiffirst)
gen sampleweight = response/surveyedhh_psu
replace oneiffirst=0 if oneiffirst~=1
gen double Winverse_sqrt = sqrt(weight)/sqrt(population)

program gmm_nonresp
version 12
syntax varlist if, at(name) mylhs(varlist) myrhs(varlist) myidvar(varlist)
quietly {
tempvar explinspec pophat
gen double `explinspec' = `at'[1,1] `if'
local j=2
foreach var of varlist `myrhs' {
replace `explinspec' = `explinspec' + `var'*`at'[1,`j'] `if'
local j = `j' + 1
replace `explinspec' = exp(`explinspec')
egen double `pophat' = sum(sampleweight*(1+`explinspec')/`explinspec')
`if', by(`myidvar')
replace `varlist' = (`mylhs' - `pophat')*Winverse_sqrt*oneiffirst `if'
gmm gmm_nonresp, mylhs(population) myrhs(logincome) myidvar(region)
nequations(1) parameters(theta1 theta2) instruments(logincome)
from(theta1 10 theta2 -1)

(It's possible that I should declare my instruments differently in the
GMM command. But that itself will not solve my bigger problem.)

On Mon, Jul 1, 2013 at 12:37 PM, Austin Nichols <> wrote:
> Vladimír Hlásny <>:
> I can't see that in your code:
>   , myrhs(x1) instruments(x1)
> and myrhs gets multiplied by theta2, so it must be at the individual level.
> Perhaps you should follow the usual advice, and illustrate your
> problem using a publicly available dataset.
> On Sun, Jun 30, 2013 at 11:18 PM, Vladimír Hlásny <> wrote:
>> Dear Austin:
>> Thanks for the link to optimize(). I will check whether that could
>> solve my 'region-level minimization' vs. 'household-level model'
>> problem.
>> Regarding your point:
>> What you call 'x1' is a function of all incomes in a region, not
>> income of a single household.
>> Vladimir
>> On Mon, Jul 1, 2013 at 11:10 AM, Austin Nichols <> wrote:
>>> Vladimír Hlásny <>,
>>> If you're not familiar with optimize(), start with the help file. Or
>>> just follow the link I sent.
>>> You don't seem to take my point about your trick; if you put all the
>>> weight of optimization on one residual per group, and -gmm- is trying
>>> to make that one residual orthogonal to an instrument x1=income, but
>>> you (unluckily) have x1=0 in each of those cases, then how could -gmm-
>>> possibly improve on residual times zero, equals zero? An unlucky case,
>>> but possible, given your syntax, I think.
>>> On Sun, Jun 30, 2013 at 10:02 PM, Vladimír Hlásny <> wrote:
>>>> Dear Austin:
>>>> I am computing the "one-per-region residuals" as the difference
>>>> between regional actual population and predicted population (sum of
>>>> household-inverse-probabilities). So my trick doesn't depend on luck -
>>>> the residuals contain information on all households within a region.
>>>> In the code that I pasted in my original email, notice the summation
>>>> across households:
>>>> egen double `pophat' = sum( (1+exp(b0+income*b1)) / exp(b0+income*b1))
>>>> `if', by(`region')
>>>> replace residual = (pop - `pophat') * oneiffirst
>>>> The 'oneiffirst' is a binary indicator for one residual per region, my
>>>> trick. By using that, I ensure that only one region-level residual is
>>>> considered per region. Instead, I would have liked to use an 'if'
>>>> statement (such as 'if oneiffirst'), so that Stata would know that
>>>> there are only 2500 (region-level) observations. But Stata doesn't
>>>> allow it. Is there another way to essentially restrict the sample
>>>> inside of the function evaluator program - the sample in which the
>>>> moments are evaluated - after GMM is called in a hhd-level dataset?
>>>> I am not familiar with 'optimize()'. Will that let me declare samples
>>>> so that I estimate a region-level regression in which moments are
>>>> computed from a hhd-level equation?
>>>> Thank you.
>>>> Vladimir
>>>> On Mon, Jul 1, 2013 at 1:17 AM, Austin Nichols <> wrote:
>>>>> Vladimír Hlásny <>:
>>>>> My question is: why try trick -gmm- into doing an optimization it's
>>>>> not designed for? You are trying to make the first residual within
>>>>> group orthogonal to income; what if you got unlucky and the first case
>>>>> in each group had zero income--hard to see how you could improve the
>>>>> objective function, right?
>>>>> Instead start with Mata's optimize() which can be used to roll your
>>>>> own GMM and much else besides: see e.g.
>>>>> On Sat, Jun 29, 2013 at 10:10 PM, Vladimír Hlásny <> wrote:
>>>>>> Dear Austin:
>>>>>> The model is definitely identified. Matlab runs the model well,
>>>>>> because I can use household-level and region-level variables
>>>>>> simultaneously. My trick in Stata also works, except that it produces
>>>>>> imprecise results and occasionally fails to converge. (My current
>>>>>> trick is to make Stata think that the model is at the household level,
>>>>>> and manually setting all-but-one-per-region hhd-level residuals to
>>>>>> zero.)
>>>>>> Incomes of the responding households are my instrument.
>>>>>> Essentially, because each region has a different survey-response-rate
>>>>>> and different distribution of incomes of responding households, GMM
>>>>>> estimates the relationship between households' response-probability
>>>>>> and their income (subject to assumptions on representativeness of
>>>>>> responding households).
>>>>>> In sum:
>>>>>> I need Stata to use region-level and household-level variables (or
>>>>>> matrices) simultaneously. Specifically, Stata must minimize
>>>>>> region-level residuals computed from a household-level logistic
>>>>>> equation. E.g., if I feed household-level data into the GMM
>>>>>> function-evaluator program, can I instruct the GMM to use only one
>>>>>> residual per region?
>>>>>> Vladimir
>>>>>> On Sat, Jun 29, 2013 at 10:27 PM, Austin Nichols
>>>>>> <> wrote:
>>>>>>> Vladimír Hlásny <>:
>>>>>>> I have not read the ref.  But you do not really have instruments. That
>>>>>>> is, you are not setting E(Ze) to zero with e a residual from some
>>>>>>> equation and Z your instrument; you do not have moments of that type.
>>>>>>> Seems you should start with optimize() instead of -gmm-, as you are
>>>>>>> just minimizing the sum of squared deviations from targets at the
>>>>>>> region level. Or am I still misunderstanding this exercise?
>>>>>>> On Fri, Jun 28, 2013 at 10:08 PM, Vladimír Hlásny <> wrote:
>>>>>>>> Thanks for responding, Austin.
>>>>>>>> The full reference is: Korinek, Mistiaen and Ravallion (2007), An
>>>>>>>> econometric method of correcting for unit nonresponse bias in surveys,
>>>>>>>> J. of Econometrics 136.
>>>>>>>> My sample includes 12000 responding households. I know their income,
>>>>>>>> and which of 2500 regions they come from. In addition, for each
>>>>>>>> region, I know the number of non-responding households. I find the
>>>>>>>> coefficient on income by fitting estimated regional population to
>>>>>>>> actual population:
>>>>>>>> P_i = logit f(income_i,theta)
>>>>>>>> actual_j = responding_j + nonresponding_j
>>>>>>>> theta = argmin {sum(1/P_i) - actual_j}
>>>>>>>> Response probability may not be monotonic in income. The logit may be
>>>>>>>> a non-monotonic function of income.
>>>>>>>> Thanks for any thoughts on how to estimate this in Stata, or how to
>>>>>>>> make my 'trick' (setting 12000-2500 hhd-level residuals manually to
>>>>>>>> zero) work better.
>>>>>>>> Vladimir
>>>>>>>> On Sat, Jun 29, 2013 at 1:49 AM, Austin Nichols <> wrote:
>>>>>>>>> Vladimír Hlásny <>:
>>>>>>>>> As the FAQ hints, if you don't provide full references, don't expect
>>>>>>>>> good answers.
>>>>>>>>> I don't understand your description--how are you running a logit of
>>>>>>>>> response on income, when you only have income for responders?  Can you
>>>>>>>>> give a sense of what the data looks like?
>>>>>>>>> On another topic, why would anyone expect response probability to be
>>>>>>>>> monotonic in income?
>>>>>>>>> On Fri, Jun 28, 2013 at 10:05 AM, Vladimír Hlásny <> wrote:
>>>>>>>>>> Hi,
>>>>>>>>>> I am using a method by Korinek, Mistiaen and Ravallion (2007) to
>>>>>>>>>> correct for unit-nonresponse bias. That involves estimating
>>>>>>>>>> response-probability for each household,  inferring regional
>>>>>>>>>> population from these probabilities, and fitting against actual
>>>>>>>>>> regional populations. I must use household-level data and region-level
>>>>>>>>>> data simultaneously, because coefficients in the household-level model
>>>>>>>>>> are adjusted based on fit of the regional-level populations.
>>>>>>>>>> I used a trick - manually resetting residuals of all but
>>>>>>>>>> one-per-region household - but this trick doesn't produce perfect
>>>>>>>>>> results. Please find the details, remaining problems, as well as the
>>>>>>>>>> Stata code described below. Any thoughts on this?
>>>>>>>>>> Thank you for any suggestions!
>>>>>>>>>> Vladimir Hlasny
>>>>>>>>>> Ewha Womans University
>>>>>>>>>> Seoul, Korea
>>>>>>>>>> Details:
>>>>>>>>>> I am estimating households' probability to respond to a survey as a
>>>>>>>>>> function of their income. For each responding household (12000), I
>>>>>>>>>> have data on income. Also, at the level of region (3000), I know the
>>>>>>>>>> number of responding and non-responding households.
>>>>>>>>>> I declare a logit equation of response-probability as a function of
>>>>>>>>>> income, to estimate it for all responding households.
>>>>>>>>>> The identification is provided by fitting of population in each
>>>>>>>>>> region. For each responding household, I estimate their true mass as
>>>>>>>>>> the inverse of their response probability. Then I sum the
>>>>>>>>>> response-probabilities for all households in a region, and fit it
>>>>>>>>>> against the true population.
>>>>>>>>>> Stata problem:
>>>>>>>>>> I am estimating GMM at the regional level. But, to obtain the
>>>>>>>>>> population estimate in each region, I calculate response-probabilities
>>>>>>>>>> at the household level and sum them up in a region. This region-level
>>>>>>>>>> fitting and response-probability estimation occurs
>>>>>>>>>> simultaneously/iteratively -- as logit-coefficients are adjusted to
>>>>>>>>>> minimize region-level residuals, households response-probabilities
>>>>>>>>>> change.
> *
> *   For searches and help try:
> *
> *
> *

*   For searches and help try:

© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index