Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | Josh Hyman <hyman.josh@gmail.com> |
To | statalist@hsphsun2.harvard.edu |
Subject | Re: st: Multivariate kernel regression |
Date | Mon, 29 Oct 2012 17:29:54 -0400 |
Hi Austin (and everyone), Thanks a lot for looking into that .ado file and for your advice. Based on your advice, I decided to try to code up my own code as opposed to using that .ado file. One point I'm left a little confused about is what to do if I have dichotomous covariates. Most of the kernels, K(u), where u=(x_i - x_0), seem to restrict to observations where abs(u)<1, meaning that for dichotomous covariates the local regressions will only be estimated off of observations with x_i= x_0. Doesn't that seem problematic? I wrote up some very simple code (below) to try creating a grid of values in my data and running order zero local polynomial regressions (as a start - can add x's in deviation terms, their interactions, etc later) and filling in the y-hats. This is really basic and I'm not even including a bandwidth parameter yet. But I am running into a problem because 3 of my 4 covariates are dichotomous, so I'm basically just getting a weighted average of Y's among the observations where x_i=X_0, as opposed to using observations near x_0. Do you have any thoughts on this, or do you know if conceptually it is OK to have dichotomous covariates in a multivariate kernel regression? Thanks a lot as usual, and sorry for the long lag between my responses as I try to work through this. *CREATE GROUP VARIABLE; egen group = group(lunchi white female newsch); *CREATE YHAT VARIABLE THAT IS BLANK FOR NOW BUT WILL BE FILLED IN THROUGHOUT THE BELOW LOOP; gen yhat = .; *LOOP THROUGH EACH COMBINATION OF X's SEEN IN DATA - 487 is max of group var; foreach g=1/487 {; *FOR GROUP VALUE CREATE MACRO WITH VALUE OF EACH COVARIATE; foreach x in lunchi white female newsch {; sum `x' if group==`g'; global `x'_`g' = r(mean); }; *CREATE WEIGHT USING EPANECHNIKOV KERNEL- have to add the bandwidths; gen weight = ( (3/4)*(1-(lunchi - $lunchi_`g')^2) ) * ( (3/4)*(1-(white - $white_`g')^2) ) * ( (3/4)*(1-(female - $female_`g')^2) ) * ( (3/4)*(1-(newsch - $newsch_`g')^2) ); *RUN WEIGHTED REGRESSION; reg college_yn_1 [pw=weight] if abs(lunchi - $lunchi_`g')<1 & abs(white - $white_`g')<1 & abs(female - $female_`g')<1 & abs(newsch - $newsch_`g')<1; *CREATE PREDICTED VALUE AT THIS COMBINATION OF X's; predict temp if esample, xb; replace yhat = temp if group==`g'; *DROP WEIGHT VARIABLE AND OTHER TEMP VARIABLE CREATED DURING LOOP; drop weight temp; }; On Fri, Oct 19, 2012 at 6:03 PM, Austin Nichols <austinnichols@gmail.com> wrote: > > Josh Hyman <hyman.josh@gmail.com>: > Just looked at > http://faculty.wcas.northwestern.edu/~cfm754/bounds_stata.pdf > briefly, but it does not seem to have been written by someone well > versed in Stata programming. > > More substantively: > It seems to compute the univariate ROT bandwidth for kernel density > estimates (see p.892 of the Stata manual entry on -kdensity- for the > same formula), not conditional mean (polynomial order zero) estimates > (see p.1009 of the Stata manual entry on -lpoly- for the very > different formula), in each dimension completely separately, which > seems like a terrible idea. You would be better off computing > Mahalanobis distance and using a conic kernel, then doing some kind of > cross validation to get a good bandwidth. Plus, that kernreg.ado just > computes zero-order polynomial regressions, so you are much better off > writing your own program that estimates a linear surface (hyperplane) > at each point. > > > On Fri, Oct 19, 2012 at 12:20 PM, Josh Hyman <hyman.josh@gmail.com> wrote: > > Thank you so much Austin and Shan. > > > > Shan - I very much appreciate your pointing out the .ado files on > > Manski's webpage, in particular kernreg.ado and gridgen.ado . These > > will be a great place for me to start, and seem very to be very > > similar to what Austin recommended I try starting with. Ideally I > > would like to use slightly more than 4 covariates, but this is > > terrific for now, and I will see if I can augment the code to accept a > > few more. > > > > Austin - Thanks a lot for your suggestions. I met with John DiNardo > > recently about this project, but haven't asked him about the > > multivariate kernel regression. I sent him an email yesterday to see > > if he will discuss this with me. I will begin by coding up your > > suggestion to help me understand. Your explanation was very helpful > > for me in understanding how the multivariate kernel regression is > > operating. > > > > Thanks again to you both! This was my first time posting a question to > > the Stata listserve, and I found it incredibly helpful. > > Thanks, > > Josh > > > > On Wed, Oct 17, 2012 at 2:25 PM, Austin Nichols > > <austinnichols@gmail.com> wrote: > >> > >> Josh Hyman <hyman.josh@gmail.com>: > >> Taking the mean of Y for values of X near X0 *is* a regression; you > >> are calculating the conditional mean of Y. What you describe is a > >> zero-degree local polynomial regression in -lpoly- (a regression on > >> just a constant), which is inadvisable (though -lpoly- default > >> behavior) for the reasons given in the -lpoly- manual entry. Better to > >> regress on X and interactions (all in deviation form from point X0) > >> and predict at X=X0. I recommend you start with a simple example with > >> say 100 values of a one-dimensional X and try calculating the means of > >> Y at (say) 10 values using a couple different approaches, to get a > >> sense of what you are doing. Then generalize to 100*100 values of X1 > >> and X2 and calculate mean Y at (say) 100 points on that grid. > >> > >> Did you look at http://fmwww.bc.edu/repec/bocode/t/tddens > >> (multivariate kernel density estimation)? > >> > >> Ask John DiNardo if you have conceptual questions--if he is currently > >> accessible to you at the Ford school--the big ideas may easier to > >> explain in person. > >> > >> On Wed, Oct 17, 2012 at 1:04 PM, Josh Hyman <hyman.josh@gmail.com> > >> wrote: > >> > Hi Austin (and others), > >> > > >> > Thank you very much for your reply. Sorry about my delayed response - > >> > I wanted to investigate more to make sure I understood your > >> > suggestion. > >> > > >> > I'm not sure your suggestion gets me exactly what I was looking for, > >> > and I want to clarify. My reference to -lpoly- in my initial post may > >> > have been confusing. I don't actually want to do kernel-weighted > >> > local > >> > regressions. I want to estimate "multivariate kernel regression", > >> > which to my understanding, doesn't actually involve any regressions > >> > at > >> > all. It takes the weighted average of Y for all observations near to > >> > the particular value of X, weighted using the kernel function. And > >> > where X represents more than 2 variables. So, this actually seems the > >> > same to me as multivariate kernel density estimation, which I also > >> > don't see any user-written commands for in Stata. What I am looking > >> > for, I guess is like a version of -kdens2- that allows for more than > >> > one "xvar", and wouldn't output a graph (since it would be in greater > >> > than 3 dimensions), but rather would output the fitted or predicted > >> > values of the Y (like -predict, xb-) for each observation. > >> > > >> > Regardless, it sounds like given your suggestion, one way to do this > >> > is to loop over all possible combinations of the values of the X > >> > variables and calculate the weighted Y for each combination using the > >> > kernel of my choice? Please let me know if this would be your > >> > suggestion, or if given my further clarification, if you know of any > >> > user-written commands in Stata to do this, or if you have any other > >> > suggestions. > >> > > >> > Thanks a lot for your help, and sorry again for the delayed response. > >> > Josh > >> > > >> > > >> > On Fri, Oct 12, 2012 at 3:31 PM, Austin Nichols > >> > <austinnichols@gmail.com> wrote: > >> >> Josh Hyman <hyman.josh@gmail.com>: > >> >> If you know the multivariate kernel you want to use, and the grid > >> >> you > >> >> want to smooth over, it is straightforward to loop over the grid and > >> >> compute the regressions. To program a general estimator for a wide > >> >> class of kernels would be substantially more work. See e.g. -kdens- > >> >> on SSC and > >> >> http://fmwww.bc.edu/repec/bocode/m/mf_mm_kern > >> >> http://fmwww.bc.edu/RePEc/bocode/k/kdens.pdf > >> >> > >> >> A simple conic (triangle) kernel in 2 dimensions is easiest, see > >> >> e.g. > >> >> http://fmwww.bc.edu/repec/bocode/t/tddens > >> >> > >> >> On Fri, Oct 12, 2012 at 1:49 PM, Josh Hyman <hyman.josh@gmail.com> > >> >> wrote: > >> >>> Dear Statalist users, > >> >>> > >> >>> I am trying to figure out if there is a way in Stata to perform > >> >>> multivariate kernel regression. I have investigated online and on > >> >>> the > >> >>> Statalist, but with no success. What I am looking for would be > >> >>> similar > >> >>> conceptually to the -lpoly- command, but with the ability to enter > >> >>> more > >> >>> than one "xvar". > >> >>> > >> >>> If there are no Stata commands to do this (user-written or > >> >>> otherwise), then > >> >>> do you recommend coding up a program to do this manually? I have > >> >>> used Stata > >> >>> for many years, and written programs before, but have never had to > >> >>> code up > >> >>> a regression manually. If you have suggestions on how to do this, > >> >>> or > >> >>> resources to consult, that would be greatly appreciated. > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/faqs/resources/statalist-faq/ > * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/