Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: Stata, data processing, databases, and consultants


From   Phil Schumm <pschumm@uchicago.edu>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Stata, data processing, databases, and consultants
Date   Fri, 7 Sep 2007 05:04:03 -0500

On Sep 6, 2007, at 9:15 PM, Buzz Burhans wrote:
Is anyone using Stata for data processing and report generating like this, where the data repositories are small local databases? Is it foolish not to convert such a system entirely to a database program (and move it all away from Stata?)?
<snip>

Has anyone used *.dta files as a database, or is this foolish and would it be much better to use real databases for the data repository?

We do this sort of thing all the time (i.e., use Stata to manage large amounts of data from different sources and to generate "reports"). Sometimes we do this in conjunction with an actual database, sometimes not. It depends entirely on the application.

With a good database and adequate programming skills, you can do anything as far as data management goes. Thus, those who are most comfortable working this way will always advocate a database-centered solution. However, this introduces a certain amount of overhead which may not always be necessary; moreover, in my experience, relatively few people are really good enough with SQL and/or a suitable programming language that can interface with their database to really use this strategy effectively. I've certainly seen plenty of examples where someone thought they needed a database, and, once they had one, couldn't manage to do what they needed to do with the data. A good object-relational mapper (e.g., SQLAlchemy) can help with this, but only if you are already comfortable working in another programming language (e.g., SQLAlchemy is a toolkit for Python).

As you know, a .dta file is not a database, nor, for that matter, is an Excel file (actually, I cringe whenever I hear of someone using Excel for data because of it's penchant for auto-formatting and the inability to version or diff files). Whether or not you need a database depends on things like the following:

1) do you need to provide distributed, real-time access (e.g., over the web) to the data?
2) do you need to integrate the data into a larger application or workflow?
3) do you need to provide concurrent access (especially write access)?
4) do you have a complicated data model which would benefit from a relational or object-oriented design?
5) do you need to store things that would be difficult (or impossible) to store in Stata (e.g., Unicode strings, graphics files, etc.)?
6) are you working with *very* large amounts of data?

If your answer to any of these questions is yes, then it's likely that you should be using a database. However, there are lots of data management applications (especially in the scientific community where I work) that don't meet these criteria, and for these a strictly Stata-based system is often very effective.

Unfortunately I am swamped right now with other projects, but if you want, contact me off list and I might be able to provide a bit more help.


-- Phil

*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/




© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index