[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Developing a Research Dataset

From   David Souther <[email protected]>
To   [email protected]
Subject   Re: st: Developing a Research Dataset
Date   Thu, 21 Jan 2010 11:28:25 -0600


There seem to be 2 issues here:  first, dealing with multiple versions
of do-files in a research group and second, keeping track of a
variable after it has been recoded any number of ways.

For the first issue, the trick is to implement some kind of versioning
system where research team members can "check out" code and you can
review revision history.  There are many free versions of this kind of
set-up, but you may want to take a look at subversion, atlassian, or
google's free implementation of it at
(if you're okay with the code being readonly opensource).

For the second issue, I'm not sure what "...and then do my trims and
recoding through a
 variety of format statements which I could electively  attach to the
core variables" means in SAS-speak, but this doesn't sound much
different than having a raw dataset, running a do-file or series of
do-files to make copies of & manipulate the variable(s) of interest,
and then saving the dataset with new name.  So for your continuous
variable that needs to be trimmed/recategorized in many ways,   you
aren't deleting the original variable, you're using the do-file to
create these extra versions of the recoded variable.  You can remove
them when you're done with the session if you'd like.  It sounds like
you want to avoid confusion among the research team about which
derivative variable to use, so here is where good use of variable
names/labels/notes come into play and modifying your do-file to get
rid of unnecessary derivatives.

Finally, if size of the dataset is the issue when creating many
derivatives of a var, consider creating a sub-dataset & sub-dofile of
just those derivatives...then later you can re-merge in whatever
version of the variable you need for analysis (sort of like keeping
these variables in their own "table" and querying them out -- in
relational database-speak).

sysuse auto.dta
egen price_categ= cut(price), group(5) label
tab price_categ, gen(price_group)
clonevar price_categ2 = price_categ
replace price_categ2 = . if for==1
li price_cat* for
**create record-linking ID
**if you think you won't be able
**to link by some other vars later
 g id = _n
keep id price*  //assuming no other "price" vars in the data
save "auto_priceonly.dta", replace
drop price*

save "auto2.dta", replace

*later you want to query out price_group2 only
*and merge it back into the auto2.dta

use auto2.dta, clear
merge 1:1 id using "auto_priceonly.dta", keepusing(price_group2)

continuos On Wed, Jan 20, 2010 at 4:09 PM, Rob James <[email protected]> wrote:
> I work in a research group that is about to build a multiyear data
> structure to support research.  Historically, data management strategies
> have resulted in multiple derivative versions of key variables and
> concepts  - for example, imagine a continuous variable that is then
> variously categorized, variously trimmed, etc..  A whole cluster of
> derived variables results, but the underlying do files are not uniformly
> preserved.  This undermines the integrity of the resulting derivative
> datasets.  I think this is a pretty typical story.
> Clearly we are not alone in this challenge. In SAS I might generate a
> root set of variables, and then do my trims and recoding through a
> variety of format statements which I could electively  attach to the
> core variables. However, that concept doesn't quite fit STATA.
> Therefore, I'd invite suggestions on how you are managing this sort of
> data integrity/documentation problem within STATA environments.
> Thanks,
> Rob
> *
> *   For searches and help try:
> *
> *
> *

*   For searches and help try:

© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index