[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: st: Stata 11 data format

From   "Joseph Coveney" <>
To   <>
Subject   RE: st: Stata 11 data format
Date   Wed, 1 Jul 2009 11:42:22 +0900

Nick Cox wrote:

There are issues here at a variety of levels. One of the simplest is
that by default variable labels are used for display in several commands
for graphs and tables. Thus allowing what Markus wants could be only
accompanied by truncated variable label display in many contexts. 

It seems to me that the main effort might well be focused on writing
preprocessors to turn long variable labels from other packages' files to
Stata notes, but I'm not volunteering. 


Markus Hahn

Alan wrote:
> Variable and dataset labels still have a maximum length of 80
> I am not sure what Markus wants to put in the labels that is longer
> than 80 characters, but Stata's ability to put -notes- on variables
> and the dataset as a whole are what I would recommend.  Individual
> notes may be up to 67,784 characters long, and each variable and
> the dataset as a whole may have up to 9,999 notes.

As far I know, other packages such as Spss (or whatever it is called
now) do support longer labels for variables. The problem I see with the
limitation of 80 characters is that some data providers do not provide
native Stata data files. Converting data files, let's say from Spss
format to Stata format, could lead to truncated variable labels if the
Spss labels are longer than 80 characters. What's so annoying about this
is that sometimes the most interesting part of the label is at the end
and it is the end at which variable labels get truncated. I understand
that this is not Stata's problem per se. It may be the fault of the data
providers that create variable labels that are too long but still these
longer labels could contain valuable information. I don't see a reason
for Stata not having longer variable labels while value labels support
strings as long as 32,000 characters. If the problem is that Stata's
data format would not easily support longer variable labels (due to
performance or memory issues?), why not just save variable labels like

I've had experience with variable label truncation when converting SAS datasets
(256-character limit, I believe) to Stata datasets.  In all but one case that I
can recall, the sender was putting value label information into the variable
label, for example (fictional, for illustration), for a variable named CERVESS
the label would be 'Cerebral Artery -- 1 = L ICA 2 = R ICA 3 = L MCA Seg I . .
.'.  Often, the length results from including the kind of metadata that doesn't
really belong in a variable label, and wouldn't normally be put there except out
of habit or concern that the value labels would be somehow separated from the
dataset in transit.

In the one exception, the variable labels contained the "question" text from the
data-collection forms (the text of the items as shown on the survey instrument
or questionnaire).  I had Stat/Transfer convert the several dozens of SAS
datasets to SAS programs + ASCII data files, and used the preprocessor approach
that Nick mentions.  It's not a major chore to prepare a do-file that -infile-s
the resulting SAS programs into Stata as a string datasets and parses the LABEL
sections into do-files, directing the variable labels into -notes- associated
with the corresponding Stata variables.  This was in the days before Mata, and I
cut the text streams into 144-character chunks when bringing them in, and
re-assembled them into the -note-s via local macro variables.

Joseph Coveney

*   For searches and help try:

© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index