Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: How to Correctly Structure a CSV before Loading it into STATA

From   "Stephen R. Clark" <>
To   <>
Subject   st: How to Correctly Structure a CSV before Loading it into STATA
Date   Wed, 26 May 2010 23:42:45 -0500

Dear Statalisters:

Hello.  I am a long-time member, but a first-time writer.

I am using STATA/IC 10.1.  

I have primarily used STATA for cross-sectional analysis, but I now need to
use it to engage in panel data analysis.  Thankfully, from my reading of
posts to this forum, I have learned that STATA has very powerful panel data
analysis features. 

Now, let me get to my question.  I have an unbalanced panel of data that
consists of 20 cross-sectional units (markets). Each of these markets
contains a different number of time-series (daily) observations. These range
from 31 days for the shortest market to 48 days for the longest market.

I currently have the data in stacked (long) form in a CSV file.  I am
dealing with "relative dates," so I am just using integer values (not actual
dates) for the date variable.  The data are, somewhat arbitrarily, organized
in this stacked format according to alphabetical order of the cross-section
name. To be as clear as possible, please let me specify in more detail how
the data is arranged in the CSV file:

Relative-Day   Market (# of observations)   Dependent Variable   Independent

Under the relevant headings, I have 43 observations for "Market A." I then
have 41 observations for "Market B," and so on until "Market T" (the 20th
and final market), which has 40 observations. 

The missing data values can arguably be considered as randomly missing, so I
am not concerned about any potential inferential problems associated with
having an unbalanced panel. What I am concerned with is how to structure the
data in the CSV file before importing it into STATA. 

Since the longest market has 48 observations, do I need to have 48 rows for
each cross-section with blank cells where the data is missing? In other
words, do I need to "artificially balance" the data before importing it into
STATA?  If not, then will I be fine leaving the data in stacked (long)
format, given an unequal number of days for each of the cross-sections?

In considering my question, please be advised that my analysis will involve
the use of lagged values of the dependent variable. In other words, I will
be conducting dynamic panel data analysis. As such, I need STATA to
recognize the panel structure of the data and not "lag into" the values for
the preceding cross-section.

Finally, if I need to "artificially balance" the data prior to importing it
into STATA, then should I enter the NA values at the beginning or at the end
of the respective markets? For instance, say that I am dealing with Market
A, which has 43 observations. With the maximum number of observations at 48,
I would need to enter 5 NA values. Should I do this as:

43 values

or as 

43 values

Thanks in advance for your help.

Stephen Clark

*   For searches and help try:

© Copyright 1996–2016 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index