Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Merge Panel Datasets

From   Phil Schumm <>
Subject   Re: st: Merge Panel Datasets
Date   Mon, 20 Jun 2011 10:39:11 -0500

On Jun 19, 2011, at 8:41 PM, Diana Beketova wrote:
This is totally true that I first had to create 'total foreign ownership' and 'total domestic ownership' in order to make one observation line out of many. But I first wanted just to try to merge both data files, so I can see if this merge can be successful at all and where are my week points to work on.

Seems reasonable, though note that you could also do this with

    merge 1:m ID_NUMBER YEAR using file2, keepusing(ID_NUMBER YEAR)

(i.e., ignore for now the rest of the variables in the second file) which would cut down on your memory usage.

I had an idea about building year clusters because I have a range of years 2002-2010. So I can build 3x3 year clusters: 2002-2004, 2005-2007, 2008-2010. Within each of these years I can generate new variables for Total Assets and Oper. Revenue that will be averages of Total Assets and Oper. Revenue within this cluster. Because ownership is so oddly distributed, there is a high probability that there will be only one observation per year cluster. At the end I would use Heckman correction method in order to correct for selection bias. Or also Tobit-model for censored variables. Do you think, this methodology could be reasonable to use? Otherwise, I don’t know how to match these to files. I have to say that data comes from an emerging market and is very biased and incomplete. Maybe you know further ways how to deal with the bias problem?

I don't see how your "cluster" strategy is related to the use of a selection model (e.g., Heckman) or censored regression model (e.g., Tobit). Moreover, I know absolutely nothing about this substantive area, so I cannot comment intelligently on your strategy. Grouping three years together may affect your results (e.g., it will smooth out year-to-year changes), so at a minimum, you would need to do a sensitivity analysis to see how your choice of endpoints (including size of "cluster") affects things. Of perhaps less importance, you might also want to take account of the fact that a mean of three years has different properties than a mean of only one year (if the data for the other two years are missing).

Of critical importance before proceeding with any strategy is to have a good understanding of why the missing data are missing, and to think about what effects this might have on your results (even if you don't explicitly take account of this in your analysis).

-- Phil

*   For searches and help try:

© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index