Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

# Re: st: Multivariate kernel regression

 From Josh Hyman To statalist@hsphsun2.harvard.edu Subject Re: st: Multivariate kernel regression Date Mon, 29 Oct 2012 17:29:54 -0400

```Hi Austin (and everyone),

Based on your advice, I decided to try to code up my own
code as opposed to using that .ado file. One point I'm left a little
confused about is what to do if I have dichotomous
covariates. Most of the kernels, K(u), where u=(x_i - x_0), seem to
restrict to observations where
abs(u)<1, meaning that for dichotomous covariates the local
regressions will only be estimated off of observations with x_i= x_0.
Doesn't that seem problematic?

I wrote up some very simple code (below) to try creating a grid of
values in my data and running order zero local polynomial regressions
(as a start - can add x's in deviation terms, their interactions, etc
later) and filling in the y-hats. This is really basic and I'm not
even including
a bandwidth parameter yet. But I am running into a problem because 3
of my 4 covariates are dichotomous, so I'm basically just getting a
weighted
average of Y's among the observations where x_i=X_0, as opposed to
using observations near x_0.

Do you have any thoughts on this, or do you know if conceptually it is
OK to have dichotomous covariates in a multivariate kernel regression?
Thanks a lot as usual, and sorry for the long lag between my responses
as I try to work through this.

*CREATE GROUP VARIABLE;
egen group = group(lunchi white female newsch);

*CREATE YHAT VARIABLE THAT IS BLANK FOR NOW BUT WILL BE FILLED IN
THROUGHOUT THE BELOW LOOP;
gen yhat = .;

*LOOP THROUGH EACH COMBINATION OF X's SEEN IN DATA - 487 is max of group var;
foreach g=1/487 {;

*FOR GROUP VALUE CREATE MACRO WITH VALUE OF EACH COVARIATE;
foreach x in lunchi white female newsch {;
sum `x' if group==`g';
global `x'_`g' = r(mean);
};

*CREATE WEIGHT USING EPANECHNIKOV KERNEL- have to add the bandwidths;
gen weight = ( (3/4)*(1-(lunchi - \$lunchi_`g')^2) ) * (
(3/4)*(1-(white - \$white_`g')^2) ) * ( (3/4)*(1-(female -
\$female_`g')^2) )
* ( (3/4)*(1-(newsch - \$newsch_`g')^2) );

*RUN WEIGHTED REGRESSION;
reg college_yn_1 [pw=weight] if abs(lunchi - \$lunchi_`g')<1 &
abs(white - \$white_`g')<1 & abs(female - \$female_`g')<1 & abs(newsch -
\$newsch_`g')<1;

*CREATE PREDICTED VALUE AT THIS COMBINATION OF X's;
predict temp if esample, xb;
replace yhat = temp if group==`g';

*DROP WEIGHT VARIABLE AND OTHER TEMP VARIABLE CREATED DURING LOOP;
drop weight temp;
};

On Fri, Oct 19, 2012 at 6:03 PM, Austin Nichols <austinnichols@gmail.com> wrote:
>
> Josh Hyman <hyman.josh@gmail.com>:
> Just looked at
> http://faculty.wcas.northwestern.edu/~cfm754/bounds_stata.pdf
> briefly, but it does not seem to have been written by someone well
> versed in Stata programming.
>
> More substantively:
> It seems to compute the univariate ROT bandwidth for kernel density
> estimates (see p.892 of the Stata manual entry on -kdensity- for the
> same formula), not conditional mean (polynomial order zero) estimates
> (see p.1009 of the Stata manual entry on -lpoly- for the very
> different formula), in each dimension completely separately, which
> seems like a terrible idea.  You would be better off computing
> Mahalanobis distance and using a conic kernel, then doing some kind of
> cross validation to get a good bandwidth.  Plus, that kernreg.ado just
> computes zero-order polynomial regressions, so you are much better off
> writing your own program that estimates a linear surface (hyperplane)
> at each point.
>
>
> On Fri, Oct 19, 2012 at 12:20 PM, Josh Hyman <hyman.josh@gmail.com> wrote:
> > Thank you so much Austin and Shan.
> >
> > Shan - I very much appreciate your pointing out the .ado files on
> > will be a great place for me to start, and seem very to be very
> > similar to what Austin recommended I try starting with. Ideally I
> > would like to use slightly more than 4 covariates, but this is
> > terrific for now, and I will see if I can augment the code to accept a
> > few more.
> >
> > Austin - Thanks a lot for your suggestions. I met with John DiNardo
> > multivariate kernel regression. I sent him an email yesterday to see
> > if he will discuss this with me. I will begin by coding up your
> > suggestion to help me understand. Your explanation was very helpful
> > for me in understanding how the multivariate kernel regression is
> > operating.
> >
> > Thanks again to you both! This was my first time posting a question to
> > the Stata listserve, and I found it incredibly helpful.
> > Thanks,
> >   Josh
> >
> > On Wed, Oct 17, 2012 at 2:25 PM, Austin Nichols
> > <austinnichols@gmail.com> wrote:
> >>
> >> Josh Hyman <hyman.josh@gmail.com>:
> >> Taking the mean of Y for values of X near X0 *is* a regression; you
> >> are calculating the conditional mean of Y. What you describe is a
> >> zero-degree local polynomial regression in -lpoly- (a regression on
> >> just a constant), which is inadvisable (though -lpoly- default
> >> behavior) for the reasons given in the -lpoly- manual entry. Better to
> >> regress on X and interactions (all in deviation form from point X0)
> >> and predict at X=X0.  I recommend you start with a simple example with
> >> say 100 values of a one-dimensional X and try calculating the means of
> >> Y at (say) 10 values using a couple different approaches, to get a
> >> sense of what you are doing.  Then generalize to 100*100 values of X1
> >> and X2 and calculate mean Y at (say) 100 points on that grid.
> >>
> >> Did you look at http://fmwww.bc.edu/repec/bocode/t/tddens
> >> (multivariate kernel density estimation)?
> >>
> >> Ask John DiNardo if you have conceptual questions--if he is currently
> >> accessible to you at the Ford school--the big ideas may easier to
> >> explain in person.
> >>
> >> On Wed, Oct 17, 2012 at 1:04 PM, Josh Hyman <hyman.josh@gmail.com>
> >> wrote:
> >> > Hi Austin (and others),
> >> >
> >> > I wanted to investigate more to make sure I understood your
> >> > suggestion.
> >> >
> >> > I'm not sure your suggestion gets me exactly what I was looking for,
> >> > and I want to clarify. My reference to -lpoly- in my initial post may
> >> > have been confusing. I don't actually want to do kernel-weighted
> >> > local
> >> > regressions. I want to estimate "multivariate kernel regression",
> >> > which to my understanding, doesn't actually involve any regressions
> >> > at
> >> > all. It takes the weighted average of Y for all observations near to
> >> > the particular value of X, weighted using the kernel function. And
> >> > where X represents more than 2 variables. So, this actually seems the
> >> > same to me as multivariate kernel density estimation, which I also
> >> > don't see any user-written commands for in Stata. What I am looking
> >> > for, I guess is like a version of -kdens2- that allows for more than
> >> > one "xvar", and wouldn't output a graph (since it would be in greater
> >> > than 3 dimensions), but rather would output the fitted or predicted
> >> > values of the Y (like -predict, xb-) for each observation.
> >> >
> >> > Regardless, it sounds like given your suggestion, one way to do this
> >> > is to loop over all possible combinations of the values of the X
> >> > variables and calculate the weighted Y for each combination using the
> >> > kernel of my choice? Please let me know if this would be your
> >> > suggestion, or if given my further clarification, if you know of any
> >> > user-written commands in Stata to do this, or if you have any other
> >> > suggestions.
> >> >
> >> > Thanks a lot for your help, and sorry again for the delayed response.
> >> > Josh
> >> >
> >> >
> >> > On Fri, Oct 12, 2012 at 3:31 PM, Austin Nichols
> >> > <austinnichols@gmail.com> wrote:
> >> >> Josh Hyman <hyman.josh@gmail.com>:
> >> >> If you know the multivariate kernel you want to use, and the grid
> >> >> you
> >> >> want to smooth over, it is straightforward to loop over the grid and
> >> >> compute the regressions.  To program a general estimator for a wide
> >> >> class of kernels would be substantially more work.  See e.g. -kdens-
> >> >> on SSC and
> >> >> http://fmwww.bc.edu/repec/bocode/m/mf_mm_kern
> >> >> http://fmwww.bc.edu/RePEc/bocode/k/kdens.pdf
> >> >>
> >> >> A simple conic (triangle) kernel in 2 dimensions is easiest, see
> >> >> e.g.
> >> >> http://fmwww.bc.edu/repec/bocode/t/tddens
> >> >>
> >> >> On Fri, Oct 12, 2012 at 1:49 PM, Josh Hyman <hyman.josh@gmail.com>
> >> >> wrote:
> >> >>> Dear Statalist users,
> >> >>>
> >> >>> I am trying to figure out if there is a way in Stata to perform
> >> >>> multivariate kernel regression. I have investigated online and on
> >> >>> the
> >> >>> Statalist, but with no success. What I am looking for would be
> >> >>> similar
> >> >>> conceptually to the -lpoly- command, but with the ability to enter
> >> >>> more
> >> >>> than one "xvar".
> >> >>>
> >> >>> If there are no Stata commands to do this (user-written or
> >> >>> otherwise), then
> >> >>> do you recommend coding up a program to do this manually? I have
> >> >>> used Stata
> >> >>> for many years, and written programs before, but have never had to
> >> >>> code up
> >> >>> a regression manually. If you have suggestions on how to do this,
> >> >>> or
> >> >>> resources to consult, that would be greatly appreciated.
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/
```