Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Multivariate kernel regression

From	"Millimet, Daniel" <[email protected]>
To	"<[email protected]>" <[email protected]>
Subject	Re: st: Multivariate kernel regression
Date	Mon, 29 Oct 2012 21:35:12 +0000

See papers by Jeff Racine and Qi Li on no parametric regression with "mixed data types."  They show how to do kernel estimation with discrete and continuous covariates.

Sent from my iPad

On Oct 29, 2012, at 4:31 PM, "Josh Hyman" <[email protected]> wrote:

> Hi Austin (and everyone),
> 
>   Thanks a lot for looking into that .ado file and for your advice.
> Based on your advice, I decided to try to code up my own
> code as opposed to using that .ado file. One point I'm left a little
> confused about is what to do if I have dichotomous
> covariates. Most of the kernels, K(u), where u=(x_i - x_0), seem to
> restrict to observations where
> abs(u)<1, meaning that for dichotomous covariates the local
> regressions will only be estimated off of observations with x_i= x_0.
> Doesn't that seem problematic?
> 
> I wrote up some very simple code (below) to try creating a grid of
> values in my data and running order zero local polynomial regressions
> (as a start - can add x's in deviation terms, their interactions, etc
> later) and filling in the y-hats. This is really basic and I'm not
> even including
> a bandwidth parameter yet. But I am running into a problem because 3
> of my 4 covariates are dichotomous, so I'm basically just getting a
> weighted
> average of Y's among the observations where x_i=X_0, as opposed to
> using observations near x_0.
> 
> Do you have any thoughts on this, or do you know if conceptually it is
> OK to have dichotomous covariates in a multivariate kernel regression?
> Thanks a lot as usual, and sorry for the long lag between my responses
> as I try to work through this.
> 
> *CREATE GROUP VARIABLE;
> egen group = group(lunchi white female newsch);
> 
> *CREATE YHAT VARIABLE THAT IS BLANK FOR NOW BUT WILL BE FILLED IN
> THROUGHOUT THE BELOW LOOP;
> gen yhat = .;
> 
> *LOOP THROUGH EACH COMBINATION OF X's SEEN IN DATA - 487 is max of group var;
> foreach g=1/487 {;
> 
>    *FOR GROUP VALUE CREATE MACRO WITH VALUE OF EACH COVARIATE;
>    foreach x in lunchi white female newsch {;
>        sum `x' if group==`g';
>        global `x'_`g' = r(mean);
>    };
> 
>    *CREATE WEIGHT USING EPANECHNIKOV KERNEL- have to add the bandwidths;
>    gen weight = ( (3/4)*(1-(lunchi - $lunchi_`g')^2) ) * (
> (3/4)*(1-(white - $white_`g')^2) ) * ( (3/4)*(1-(female -
> $female_`g')^2) )
>        * ( (3/4)*(1-(newsch - $newsch_`g')^2) );
> 
>    *RUN WEIGHTED REGRESSION;
>    reg college_yn_1 [pw=weight] if abs(lunchi - $lunchi_`g')<1 &
> abs(white - $white_`g')<1 & abs(female - $female_`g')<1 & abs(newsch -
> $newsch_`g')<1;
> 
>    *CREATE PREDICTED VALUE AT THIS COMBINATION OF X's;
>    predict temp if esample, xb;
>    replace yhat = temp if group==`g';
> 
>    *DROP WEIGHT VARIABLE AND OTHER TEMP VARIABLE CREATED DURING LOOP;
>    drop weight temp;
> };
> 
> 
> On Fri, Oct 19, 2012 at 6:03 PM, Austin Nichols <[email protected]> wrote:
>> 
>> Josh Hyman <[email protected]>:
>> Just looked at
>> http://faculty.wcas.northwestern.edu/~cfm754/bounds_stata.pdf
>> briefly, but it does not seem to have been written by someone well
>> versed in Stata programming.
>> 
>> More substantively:
>> It seems to compute the univariate ROT bandwidth for kernel density
>> estimates (see p.892 of the Stata manual entry on -kdensity- for the
>> same formula), not conditional mean (polynomial order zero) estimates
>> (see p.1009 of the Stata manual entry on -lpoly- for the very
>> different formula), in each dimension completely separately, which
>> seems like a terrible idea.  You would be better off computing
>> Mahalanobis distance and using a conic kernel, then doing some kind of
>> cross validation to get a good bandwidth.  Plus, that kernreg.ado just
>> computes zero-order polynomial regressions, so you are much better off
>> writing your own program that estimates a linear surface (hyperplane)
>> at each point.
>> 
>> 
>> On Fri, Oct 19, 2012 at 12:20 PM, Josh Hyman <[email protected]> wrote:
>>> Thank you so much Austin and Shan.
>>> 
>>> Shan - I very much appreciate your pointing out the .ado files on
>>> Manski's webpage, in particular kernreg.ado and gridgen.ado . These
>>> will be a great place for me to start, and seem very to be very
>>> similar to what Austin recommended I try starting with. Ideally I
>>> would like to use slightly more than 4 covariates, but this is
>>> terrific for now, and I will see if I can augment the code to accept a
>>> few more.
>>> 
>>> Austin - Thanks a lot for your suggestions. I met with John DiNardo
>>> recently about this project, but haven't asked him about the
>>> multivariate kernel regression. I sent him an email yesterday to see
>>> if he will discuss this with me. I will begin by coding up your
>>> suggestion to help me understand. Your explanation was very helpful
>>> for me in understanding how the multivariate kernel regression is
>>> operating.
>>> 
>>> Thanks again to you both! This was my first time posting a question to
>>> the Stata listserve, and I found it incredibly helpful.
>>> Thanks,
>>>  Josh
>>> 
>>> On Wed, Oct 17, 2012 at 2:25 PM, Austin Nichols
>>> <[email protected]> wrote:
>>>> 
>>>> Josh Hyman <[email protected]>:
>>>> Taking the mean of Y for values of X near X0 *is* a regression; you
>>>> are calculating the conditional mean of Y. What you describe is a
>>>> zero-degree local polynomial regression in -lpoly- (a regression on
>>>> just a constant), which is inadvisable (though -lpoly- default
>>>> behavior) for the reasons given in the -lpoly- manual entry. Better to
>>>> regress on X and interactions (all in deviation form from point X0)
>>>> and predict at X=X0.  I recommend you start with a simple example with
>>>> say 100 values of a one-dimensional X and try calculating the means of
>>>> Y at (say) 10 values using a couple different approaches, to get a
>>>> sense of what you are doing.  Then generalize to 100*100 values of X1
>>>> and X2 and calculate mean Y at (say) 100 points on that grid.
>>>> 
>>>> Did you look at http://fmwww.bc.edu/repec/bocode/t/tddens
>>>> (multivariate kernel density estimation)?
>>>> 
>>>> Ask John DiNardo if you have conceptual questions--if he is currently
>>>> accessible to you at the Ford school--the big ideas may easier to
>>>> explain in person.
>>>> 
>>>> On Wed, Oct 17, 2012 at 1:04 PM, Josh Hyman <[email protected]>
>>>> wrote:
>>>>> Hi Austin (and others),
>>>>> 
>>>>> Thank you very much for your reply. Sorry about my delayed response -
>>>>> I wanted to investigate more to make sure I understood your
>>>>> suggestion.
>>>>> 
>>>>> I'm not sure your suggestion gets me exactly what I was looking for,
>>>>> and I want to clarify. My reference to -lpoly- in my initial post may
>>>>> have been confusing. I don't actually want to do kernel-weighted
>>>>> local
>>>>> regressions. I want to estimate "multivariate kernel regression",
>>>>> which to my understanding, doesn't actually involve any regressions
>>>>> at
>>>>> all. It takes the weighted average of Y for all observations near to
>>>>> the particular value of X, weighted using the kernel function. And
>>>>> where X represents more than 2 variables. So, this actually seems the
>>>>> same to me as multivariate kernel density estimation, which I also
>>>>> don't see any user-written commands for in Stata. What I am looking
>>>>> for, I guess is like a version of -kdens2- that allows for more than
>>>>> one "xvar", and wouldn't output a graph (since it would be in greater
>>>>> than 3 dimensions), but rather would output the fitted or predicted
>>>>> values of the Y (like -predict, xb-) for each observation.
>>>>> 
>>>>> Regardless, it sounds like given your suggestion, one way to do this
>>>>> is to loop over all possible combinations of the values of the X
>>>>> variables and calculate the weighted Y for each combination using the
>>>>> kernel of my choice? Please let me know if this would be your
>>>>> suggestion, or if given my further clarification, if you know of any
>>>>> user-written commands in Stata to do this, or if you have any other
>>>>> suggestions.
>>>>> 
>>>>> Thanks a lot for your help, and sorry again for the delayed response.
>>>>> Josh
>>>>> 
>>>>> 
>>>>> On Fri, Oct 12, 2012 at 3:31 PM, Austin Nichols
>>>>> <[email protected]> wrote:
>>>>>> Josh Hyman <[email protected]>:
>>>>>> If you know the multivariate kernel you want to use, and the grid
>>>>>> you
>>>>>> want to smooth over, it is straightforward to loop over the grid and
>>>>>> compute the regressions.  To program a general estimator for a wide
>>>>>> class of kernels would be substantially more work.  See e.g. -kdens-
>>>>>> on SSC and
>>>>>> http://fmwww.bc.edu/repec/bocode/m/mf_mm_kern
>>>>>> http://fmwww.bc.edu/RePEc/bocode/k/kdens.pdf
>>>>>> 
>>>>>> A simple conic (triangle) kernel in 2 dimensions is easiest, see
>>>>>> e.g.
>>>>>> http://fmwww.bc.edu/repec/bocode/t/tddens
>>>>>> 
>>>>>> On Fri, Oct 12, 2012 at 1:49 PM, Josh Hyman <[email protected]>
>>>>>> wrote:
>>>>>>> Dear Statalist users,
>>>>>>> 
>>>>>>> I am trying to figure out if there is a way in Stata to perform
>>>>>>> multivariate kernel regression. I have investigated online and on
>>>>>>> the
>>>>>>> Statalist, but with no success. What I am looking for would be
>>>>>>> similar
>>>>>>> conceptually to the -lpoly- command, but with the ability to enter
>>>>>>> more
>>>>>>> than one "xvar".
>>>>>>> 
>>>>>>> If there are no Stata commands to do this (user-written or
>>>>>>> otherwise), then
>>>>>>> do you recommend coding up a program to do this manually? I have
>>>>>>> used Stata
>>>>>>> for many years, and written programs before, but have never had to
>>>>>>> code up
>>>>>>> a regression manually. If you have suggestions on how to do this,
>>>>>>> or
>>>>>>> resources to consult, that would be greatly appreciated.
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> *   http://www.ats.ucla.edu/stat/stata/
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

References:
- st: Multivariate kernel regression
  - From: Josh Hyman <[email protected]>
- Re: st: Multivariate kernel regression
  - From: Austin Nichols <[email protected]>
- Re: st: Multivariate kernel regression
  - From: Josh Hyman <[email protected]>
- Re: st: Multivariate kernel regression
  - From: Austin Nichols <[email protected]>
- Re: st: Multivariate kernel regression
  - From: Josh Hyman <[email protected]>
- Re: st: Multivariate kernel regression
  - From: Austin Nichols <[email protected]>
- Re: st: Multivariate kernel regression
  - From: Josh Hyman <[email protected]>

Prev by Date: Re: st: Multivariate kernel regression
Next by Date: st: Re: st. aweight
Previous by thread: Re: st: Multivariate kernel regression
Next by thread: st: concentrated likelihood function?
Index(es):
- Date
- Thread