Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Merge Panel Datasets
From 
 
Phil Schumm <[email protected]> 
To 
 
[email protected] 
Subject 
 
Re: st: Merge Panel Datasets 
Date 
 
Mon, 20 Jun 2011 10:39:11 -0500 
On Jun 19, 2011, at 8:41 PM, Diana Beketova wrote:
This is totally true that I first had to create 'total foreign  
ownership' and 'total domestic ownership' in order to make one  
observation line out of many. But I first wanted just to try to  
merge both data files, so I can see if this merge can be successful  
at all and where are my week points to work on.
Seems reasonable, though note that you could also do this with
    merge 1:m ID_NUMBER YEAR using file2, keepusing(ID_NUMBER YEAR)
(i.e., ignore for now the rest of the variables in the second file)  
which would cut down on your memory usage.
I had an idea about building year clusters because I have a range of  
years 2002-2010. So I can build 3x3 year clusters: 2002-2004,  
2005-2007, 2008-2010. Within each of these years I can generate new  
variables for Total Assets and Oper. Revenue that will be averages  
of Total Assets and Oper. Revenue within this cluster. Because  
ownership is so oddly distributed, there is a high probability that  
there will be only one observation per year cluster. At the end I  
would use Heckman correction method in order to correct for  
selection bias. Or also Tobit-model for censored variables. Do you  
think, this methodology could be reasonable to use? Otherwise, I  
don’t know how to match these to files. I have to say that data  
comes from an emerging market and is very biased and incomplete.  
Maybe you know further ways how to deal with the bias problem?
I don't see how your "cluster" strategy is related to the use of a  
selection model (e.g., Heckman) or censored regression model (e.g.,  
Tobit).  Moreover, I know absolutely nothing about this substantive  
area, so I cannot comment intelligently on your strategy.  Grouping  
three years together may affect your results (e.g., it will smooth out  
year-to-year changes), so at a minimum, you would need to do a  
sensitivity analysis to see how your choice of endpoints (including  
size of "cluster") affects things.  Of perhaps less importance, you  
might also want to take account of the fact that a mean of three years  
has different properties than a mean of only one year (if the data for  
the other two years are missing).
Of critical importance before proceeding with any strategy is to have  
a good understanding of why the missing data are missing, and to think  
about what effects this might have on your results (even if you don't  
explicitly take account of this in your analysis).
-- Phil
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/