Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: GMM minimization of regional errors imputed from hhd level model

From   Vladimír Hlásny <>
Subject   Re: st: GMM minimization of regional errors imputed from hhd level model
Date   Mon, 1 Jul 2013 23:51:24 +0900

I am not sure whether my message was sent in full. Maybe the first
line was cut, looking at Statalist Archive.
I've put my dataset data.dta on my page
As I described before, it's hhd-level data that also include
region-level number of respondents and total population.

On Mon, Jul 1, 2013 at 1:22 PM, Vladimír Hlásny <> wrote:

> (Also accessible on my page .
> Unfortunately I don't know of a similar public dataset.)
> use data, clear
> sort region hhcode
> by region: gen oneiffirst=_n
> by region: egen surveyedhh_psu=max(oneiffirst)
> gen sampleweight = response/surveyedhh_psu
> replace oneiffirst=0 if oneiffirst~=1
> gen double Winverse_sqrt = sqrt(weight)/sqrt(population)
> program gmm_nonresp
> version 12
> syntax varlist if, at(name) mylhs(varlist) myrhs(varlist) myidvar(varlist)
> quietly {
> tempvar explinspec pophat
> gen double `explinspec' = `at'[1,1] `if'
> local j=2
> foreach var of varlist `myrhs' {
> replace `explinspec' = `explinspec' + `var'*`at'[1,`j'] `if'
> local j = `j' + 1
> }
> replace `explinspec' = exp(`explinspec')
> egen double `pophat' = sum(sampleweight*(1+`explinspec')/`explinspec')
> `if', by(`myidvar')
> replace `varlist' = (`mylhs' - `pophat')*Winverse_sqrt*oneiffirst `if'
> }
> end
> gmm gmm_nonresp, mylhs(population) myrhs(logincome) myidvar(region)
> nequations(1) parameters(theta1 theta2) instruments(logincome)
> from(theta1 10 theta2 -1)
> --
> (It's possible that I should declare my instruments differently in the
> GMM command. But that itself will not solve my bigger problem.)
> Vladimir
> On Mon, Jul 1, 2013 at 12:37 PM, Austin Nichols <> wrote:
>> Vladimír Hlásny <>:
>> I can't see that in your code:
>>   , myrhs(x1) instruments(x1)
>> and myrhs gets multiplied by theta2, so it must be at the individual level.
>> Perhaps you should follow the usual advice, and illustrate your
>> problem using a publicly available dataset.
>> On Sun, Jun 30, 2013 at 11:18 PM, Vladimír Hlásny <> wrote:
>>> Dear Austin:
>>> Thanks for the link to optimize(). I will check whether that could
>>> solve my 'region-level minimization' vs. 'household-level model'
>>> problem.
>>> Regarding your point:
>>> What you call 'x1' is a function of all incomes in a region, not
>>> income of a single household.
>>> Vladimir
>>> On Mon, Jul 1, 2013 at 11:10 AM, Austin Nichols <> wrote:
>>>> Vladimír Hlásny <>,
>>>> If you're not familiar with optimize(), start with the help file. Or
>>>> just follow the link I sent.
>>>> You don't seem to take my point about your trick; if you put all the
>>>> weight of optimization on one residual per group, and -gmm- is trying
>>>> to make that one residual orthogonal to an instrument x1=income, but
>>>> you (unluckily) have x1=0 in each of those cases, then how could -gmm-
>>>> possibly improve on residual times zero, equals zero? An unlucky case,
>>>> but possible, given your syntax, I think.
>>>> On Sun, Jun 30, 2013 at 10:02 PM, Vladimír Hlásny <> wrote:
>>>>> Dear Austin:
>>>>> I am computing the "one-per-region residuals" as the difference
>>>>> between regional actual population and predicted population (sum of
>>>>> household-inverse-probabilities). So my trick doesn't depend on luck -
>>>>> the residuals contain information on all households within a region.
>>>>> In the code that I pasted in my original email, notice the summation
>>>>> across households:
>>>>> egen double `pophat' = sum( (1+exp(b0+income*b1)) / exp(b0+income*b1))
>>>>> `if', by(`region')
>>>>> replace residual = (pop - `pophat') * oneiffirst
>>>>> The 'oneiffirst' is a binary indicator for one residual per region, my
>>>>> trick. By using that, I ensure that only one region-level residual is
>>>>> considered per region. Instead, I would have liked to use an 'if'
>>>>> statement (such as 'if oneiffirst'), so that Stata would know that
>>>>> there are only 2500 (region-level) observations. But Stata doesn't
>>>>> allow it. Is there another way to essentially restrict the sample
>>>>> inside of the function evaluator program - the sample in which the
>>>>> moments are evaluated - after GMM is called in a hhd-level dataset?
>>>>> I am not familiar with 'optimize()'. Will that let me declare samples
>>>>> so that I estimate a region-level regression in which moments are
>>>>> computed from a hhd-level equation?
>>>>> Thank you.
>>>>> Vladimir
>>>>> On Mon, Jul 1, 2013 at 1:17 AM, Austin Nichols <> wrote:
>>>>>> Vladimír Hlásny <>:
>>>>>> My question is: why try trick -gmm- into doing an optimization it's
>>>>>> not designed for? You are trying to make the first residual within
>>>>>> group orthogonal to income; what if you got unlucky and the first case
>>>>>> in each group had zero income--hard to see how you could improve the
>>>>>> objective function, right?
>>>>>> Instead start with Mata's optimize() which can be used to roll your
>>>>>> own GMM and much else besides: see e.g.
>>>>>> On Sat, Jun 29, 2013 at 10:10 PM, Vladimír Hlásny <> wrote:
>>>>>>> Dear Austin:
>>>>>>> The model is definitely identified. Matlab runs the model well,
>>>>>>> because I can use household-level and region-level variables
>>>>>>> simultaneously. My trick in Stata also works, except that it produces
>>>>>>> imprecise results and occasionally fails to converge. (My current
>>>>>>> trick is to make Stata think that the model is at the household level,
>>>>>>> and manually setting all-but-one-per-region hhd-level residuals to
>>>>>>> zero.)
>>>>>>> Incomes of the responding households are my instrument.
>>>>>>> Essentially, because each region has a different survey-response-rate
>>>>>>> and different distribution of incomes of responding households, GMM
>>>>>>> estimates the relationship between households' response-probability
>>>>>>> and their income (subject to assumptions on representativeness of
>>>>>>> responding households).
>>>>>>> In sum:
>>>>>>> I need Stata to use region-level and household-level variables (or
>>>>>>> matrices) simultaneously. Specifically, Stata must minimize
>>>>>>> region-level residuals computed from a household-level logistic
>>>>>>> equation. E.g., if I feed household-level data into the GMM
>>>>>>> function-evaluator program, can I instruct the GMM to use only one
>>>>>>> residual per region?
>>>>>>> Vladimir
>>>>>>> On Sat, Jun 29, 2013 at 10:27 PM, Austin Nichols
>>>>>>> <> wrote:
>>>>>>>> Vladimír Hlásny <>:
>>>>>>>> I have not read the ref.  But you do not really have instruments. That
>>>>>>>> is, you are not setting E(Ze) to zero with e a residual from some
>>>>>>>> equation and Z your instrument; you do not have moments of that type.
>>>>>>>> Seems you should start with optimize() instead of -gmm-, as you are
>>>>>>>> just minimizing the sum of squared deviations from targets at the
>>>>>>>> region level. Or am I still misunderstanding this exercise?
>>>>>>>> On Fri, Jun 28, 2013 at 10:08 PM, Vladimír Hlásny <> wrote:
>>>>>>>>> Thanks for responding, Austin.
>>>>>>>>> The full reference is: Korinek, Mistiaen and Ravallion (2007), An
>>>>>>>>> econometric method of correcting for unit nonresponse bias in surveys,
>>>>>>>>> J. of Econometrics 136.
>>>>>>>>> My sample includes 12000 responding households. I know their income,
>>>>>>>>> and which of 2500 regions they come from. In addition, for each
>>>>>>>>> region, I know the number of non-responding households. I find the
>>>>>>>>> coefficient on income by fitting estimated regional population to
>>>>>>>>> actual population:
>>>>>>>>> P_i = logit f(income_i,theta)
>>>>>>>>> actual_j = responding_j + nonresponding_j
>>>>>>>>> theta = argmin {sum(1/P_i) - actual_j}
>>>>>>>>> Response probability may not be monotonic in income. The logit may be
>>>>>>>>> a non-monotonic function of income.
>>>>>>>>> Thanks for any thoughts on how to estimate this in Stata, or how to
>>>>>>>>> make my 'trick' (setting 12000-2500 hhd-level residuals manually to
>>>>>>>>> zero) work better.
>>>>>>>>> Vladimir
>>>>>>>>> On Sat, Jun 29, 2013 at 1:49 AM, Austin Nichols <> wrote:
>>>>>>>>>> Vladimír Hlásny <>:
>>>>>>>>>> As the FAQ hints, if you don't provide full references, don't expect
>>>>>>>>>> good answers.
>>>>>>>>>> I don't understand your description--how are you running a logit of
>>>>>>>>>> response on income, when you only have income for responders?  Can you
>>>>>>>>>> give a sense of what the data looks like?
>>>>>>>>>> On another topic, why would anyone expect response probability to be
>>>>>>>>>> monotonic in income?
>>>>>>>>>> On Fri, Jun 28, 2013 at 10:05 AM, Vladimír Hlásny <> wrote:
>>>>>>>>>>> Hi,
>>>>>>>>>>> I am using a method by Korinek, Mistiaen and Ravallion (2007) to
>>>>>>>>>>> correct for unit-nonresponse bias. That involves estimating
>>>>>>>>>>> response-probability for each household,  inferring regional
>>>>>>>>>>> population from these probabilities, and fitting against actual
>>>>>>>>>>> regional populations. I must use household-level data and region-level
>>>>>>>>>>> data simultaneously, because coefficients in the household-level model
>>>>>>>>>>> are adjusted based on fit of the regional-level populations.
>>>>>>>>>>> I used a trick - manually resetting residuals of all but
>>>>>>>>>>> one-per-region household - but this trick doesn't produce perfect
>>>>>>>>>>> results. Please find the details, remaining problems, as well as the
>>>>>>>>>>> Stata code described below. Any thoughts on this?
>>>>>>>>>>> Thank you for any suggestions!
>>>>>>>>>>> Vladimir Hlasny
>>>>>>>>>>> Ewha Womans University
>>>>>>>>>>> Seoul, Korea
>>>>>>>>>>> Details:
>>>>>>>>>>> I am estimating households' probability to respond to a survey as a
>>>>>>>>>>> function of their income. For each responding household (12000), I
>>>>>>>>>>> have data on income. Also, at the level of region (3000), I know the
>>>>>>>>>>> number of responding and non-responding households.
>>>>>>>>>>> I declare a logit equation of response-probability as a function of
>>>>>>>>>>> income, to estimate it for all responding households.
>>>>>>>>>>> The identification is provided by fitting of population in each
>>>>>>>>>>> region. For each responding household, I estimate their true mass as
>>>>>>>>>>> the inverse of their response probability. Then I sum the
>>>>>>>>>>> response-probabilities for all households in a region, and fit it
>>>>>>>>>>> against the true population.
>>>>>>>>>>> Stata problem:
>>>>>>>>>>> I am estimating GMM at the regional level. But, to obtain the
>>>>>>>>>>> population estimate in each region, I calculate response-probabilities
>>>>>>>>>>> at the household level and sum them up in a region. This region-level
>>>>>>>>>>> fitting and response-probability estimation occurs
>>>>>>>>>>> simultaneously/iteratively -- as logit-coefficients are adjusted to
>>>>>>>>>>> minimize region-level residuals, households response-probabilities
>>>>>>>>>>> change.
>> *
>> *   For searches and help try:
>> *
>> *
>> *

*   For searches and help try:

© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index