[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: RE: replying to lists / keeping records: New - proposal for data documentation

From   "Allan Reese (Cefas)" <>
To   <>
Subject   st: RE: replying to lists / keeping records: New - proposal for data documentation
Date   Fri, 28 Nov 2008 11:15:14 -0000

If the consensus is that people are happy to pass replies generally via
the list, that's fine with me.  

Re Mandy's problem:
"... In step 3. above  I replace the incorrect commands with the correct
ones, and then re-run the commands to update the data sets created ...

"... it's helpful for me to  get good habit of managing data sets and do

And Kit Baum's comment:
"...  copy the wrong commands, comment  
them out, and correct their copies. That is, don't make all the  
corrections at the bottom, because that will only work if the  
incorrect variables are not used in any intermediate calculations."

A principle I've tried to propagate is to create an audit trail from
original data to final results.  This will never happen if you rely on
"the user" to keep notes of everything they do.  Fortunately, Stata has
some tools that can assist.  Every use# I make of Stata is logged -
without my having to activate it each time.  You can set "command
logging" on from, and command log files are not too big to
keep indefinitely.  Secondly, Stata will query overwriting a data file.
Mandy's approach is implied to be that the original file is kept fixed,
but all subsequent changes are re-run every time.  If so, I'd use Kit's
advice and keep old code in the DO file but commented out.  However,
this method does make it harder to trace when a data value was changed.
When a change it meant to be permanent (ie not just for the single
analysis), I would save a new copy of the datafile with the date as part
of the name.  Obviously that wouldn't work if Mandy is accessing data
from an external database she doesn't control.

# "Every use" except that copy&paste from Excel leaves no record except
that edit was invoked.  At that point I should probably put a comment in
the log to explain where the data came from.

Here's a to create new command log daily, and keep full logs
of last two sessions:

* (only allows one Stata invocation) ...
local logdate = string( d(`c(current_date)'), "%dCY-N-D" )
cmdlog using "c:\program files\stata9\logfiles\log`logdate'.txt", append
* ... and working log file of output (to rename for keeping) 
capture erase c:\statalog.bak
capture copy c:\statalog.smcl c:\statalog.bak
log using c:\statalog, replace
set memory 10m
set matsize 800
noisily di "Logs running, memory 10Mb and matrix size maximum"

Apropos improving standards for documenting data processing.  Current IT
standards for metadata (from ISO etc) are almost impenetrable but focus
entirely on the syntax of files for exchange.  Data semantics are
utterly ignored.  For example, ISO requires every dataset has a title,
but would be compeletely satisfied with a title "Why the **** should I
tell you what this is about?"  While fields may be closely defined in
specific domains, there is no requirement - or facility - to document
field usage within standards.  So your idea of sex may not be mine.  As
a result, files will be merged or mashed and the results will be
garbage.  You may have noticed that many high-profile, very expensive,
IT projects collapse when implemented.

A suggestion from myself and colleagues is therefore a much simpler
"standard" for documenting data so that it is human-usable.  It may also
be immediately compatible between computers, but that depends on the
file format and character coding conventions, etc.  Most data can be
reduced to a 2-way table (or sequence of linked tables).  We therefore
propose that a data table (think Stata dataset) should be documented
with a second table that describes the fields (=columns=variables) as a
Codebook, and a third table containing the discovery metadata as defined
in ISO standards.  It's very simple, not technical, but would promote
more computer users to notice concepts like missing values and
procedures used when coding.

Happy to send the notes expanding these ideas to individuals.  Will
leave it to the list owners to say if it would be appropriate to post
them on the Stata list ;-)


This email and any attachments are intended for the named recipient only.  Its unauthorised use, distribution, disclosure, storage or copying is not permitted.  If you have received it in error, please destroy all copies and notify the sender.  In messages of a non-business nature, the views and opinions expressed are the author's own and do not necessarily reflect those of the organisation from which it is sent.  All emails may be subject to monitoring.

*   For searches and help try:

© Copyright 1996–2015 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index