Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | "Stephen R. Clark" <stephenrclark@mchsi.com> |
To | <statalist@hsphsun2.harvard.edu> |
Subject | st: How to Correctly Structure a CSV before Loading it into STATA |
Date | Wed, 26 May 2010 23:42:45 -0500 |
Dear Statalisters: Hello. I am a long-time member, but a first-time writer. I am using STATA/IC 10.1. I have primarily used STATA for cross-sectional analysis, but I now need to use it to engage in panel data analysis. Thankfully, from my reading of posts to this forum, I have learned that STATA has very powerful panel data analysis features. Now, let me get to my question. I have an unbalanced panel of data that consists of 20 cross-sectional units (markets). Each of these markets contains a different number of time-series (daily) observations. These range from 31 days for the shortest market to 48 days for the longest market. I currently have the data in stacked (long) form in a CSV file. I am dealing with "relative dates," so I am just using integer values (not actual dates) for the date variable. The data are, somewhat arbitrarily, organized in this stacked format according to alphabetical order of the cross-section name. To be as clear as possible, please let me specify in more detail how the data is arranged in the CSV file: Relative-Day Market (# of observations) Dependent Variable Independent Variables Under the relevant headings, I have 43 observations for "Market A." I then have 41 observations for "Market B," and so on until "Market T" (the 20th and final market), which has 40 observations. The missing data values can arguably be considered as randomly missing, so I am not concerned about any potential inferential problems associated with having an unbalanced panel. What I am concerned with is how to structure the data in the CSV file before importing it into STATA. Since the longest market has 48 observations, do I need to have 48 rows for each cross-section with blank cells where the data is missing? In other words, do I need to "artificially balance" the data before importing it into STATA? If not, then will I be fine leaving the data in stacked (long) format, given an unequal number of days for each of the cross-sections? In considering my question, please be advised that my analysis will involve the use of lagged values of the dependent variable. In other words, I will be conducting dynamic panel data analysis. As such, I need STATA to recognize the panel structure of the data and not "lag into" the values for the preceding cross-section. Finally, if I need to "artificially balance" the data prior to importing it into STATA, then should I enter the NA values at the beginning or at the end of the respective markets? For instance, say that I am dealing with Market A, which has 43 observations. With the maximum number of observations at 48, I would need to enter 5 NA values. Should I do this as: NA NA NA NA NA 43 values or as 43 values NA NA NA NA NA Thanks in advance for your help. Stephen Clark * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/