Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: RE: RE: RE: RE: RE: RE: Re: RE: may not use time-series operators on string variables

From   Lee Sieswerda <[email protected]>
To   "'[email protected]'" <[email protected]>
Subject   st: RE: RE: RE: RE: RE: RE: Re: RE: may not use time-series operators on string variables
Date   Thu, 4 Jul 2002 18:11:10 -0400

I agree generally with Nick C. that the NOIR scheme is largely a
classificatory scheme with limited statistical applicability. In particular,
it seems primarily a useful way to convey the idea of precision of
measurement. Ironically, it is a classification scheme that suffers from the
same problem that it intends to illuminate. After all, many (or even most)
variables seem to approach the boundaries of the NOIR categories and blur
the differences between them.

Nevertheless, I think the NOIR classification, or something like it, could
provide some element of convenience in data analysis. Knowing at least
whether a variable is discrete or continuous allows programmers to write
Stata commands that are able to choose more intelligent defaults. 

For example, suppose you have two binary variables and you want to graph
their relationship. You would probably never want a graph with just four
points in the corners, which is what you'd get in Stata as the default
(graph var1 var2). Rather, you would almost always want a bar graph with the
mean of the one variable (i.e., the proportion or percentage of ones)
stratified by the two categories of the other. Not that Stata is incapable
of creating such a graph, or that its a lot more typing in this simple case,
but one could imagine more complicated cases where the savings would be
greater. If you could specify the variables as nominal/discrete/factor, then
that information could be used by Stata programmers to select intelligent
defaults. (Not that one would be want to be restricted to the new default -
see NJC on the dangers of dogmatism - but it would perhaps be more
convenient for the majority of cases.)  

The R language seems to have a good implementation of this idea.

So while not a critical issue in any way, I see three advantages to
incorporating a statistically intelligent variable type characteristic:
1. it would be a convenient tool for programmers,
2. it almost certainly would reduce the overall amount of typing in the
Stata community, especially the use of the PageUp key, and
3. it might be a little less frustrating for beginners.

And while Nick C. may disparage 3D pie charts, I have found a good use for
them. Have you ever tried to explain a Nerf frisbee to someone overseas by
email? Well, 3D pie charts from silly corporate earnings reports provide the
perfect model. And because they sum to 100%, the perfection of the model is
completely unaffected by any overinflation (by, say, 4 billion US dollars)
of earnings.


Lee Sieswerda, Epidemiologist
Thunder Bay District Health Unit
999 Balmoral Street
Thunder Bay, Ontario
Canada  P7B 6E7
Tel: +1 (807) 625-5957
Fax: +1 (807) 623-2369
[email protected]

> -----Original Message-----
> From:	Nick Cox [SMTP:[email protected]]
> Sent:	Thursday, July 04, 2002 12:12 PM
> To:	[email protected]
> Subject:	st: RE: RE: RE: RE: RE: Re: RE: may not use time-series
> operators on string variables
> Hakon Finne
> > Stata has a number of data types (byte, int, long, float, double, str)
> but
> > no explicit syntactical elements for variable types (nominal, ordinal,
> > interval, ratio). For computational convenience, most of Stata's
> > statistical and graphing procedures only work if the variable is stored
> as
> a
> > non-string
> > data type, even if some statistical concepts in themselves do not
> require
> > numerical values. But some procedures (e.g. -tabulate-) work on strings
> as
> > well and you might never suspect there could be a problem.
> >
> > The example from the current thread: Time series data on the state of a
> unit
> > could be any of the four variable types. Events could then be calculated
> as
> > a change in state from one time to another, but if you want to do this
> with
> > time-series operators in Stata, the variable has to be stored as a
> numerical
> > data type.
> >
> > As long as there is no syntactical way to distinguish the four variable
> > types, there are other means. Value labels help translate numbers to
> text
> > for the reader as a compensation for having to convert textual
> information
> > to numerical form. The data management tools for performing these
> > conversions in Stata abound but perhaps someone could think about
> drafting
> > an FAQ or a tutorial on how to use them in the context of variable types
> (e.
> > g., "What do I do with my categorical/nominal variables to make them
> work
> > and display properly in Stata?"). (There are some already, e.g. on pie
> > charts.)
> Hakon's broadening of the question is very interesting. However,
> if he wants Stata to encapsulate, or even to show a little respect for,
> the nominal ... ratio scheme, then I have reservations.
> To focus on what I believe to be the main point, the fourfold distinction
> nominal/ordinal/interval/ratio (NOIR is a useful mnemonic for those who
> see a black side to all this) was first proposed by the psychologist
> S.S. Stevens in 1946, and revised intermittently in various small details
> before and, in terms of publications, after his death in 1973.
> But despite being based, supposedly, on mathematical criteria,
> it serves badly as a basis for modern data analysis.
> This has often been pointed out, for example, in discussions started
> by Velleman and Wilkinson (American Statistician, 1993-1995).
> What is frequently problematic, in my view, is that this scheme, which
> on one level is just classificatory terminology,
> is often associated with a set of dogmas (dogmata?) on what are supposedly
> valid methods to use with each data type (strictly, measurement scale).
> This matter seems highly tribal: many texts and courses in (e.g.)
> psychology
> or
> sociology make a great deal of it, but there are also equally numerate
> disciplines
> in which it appears to be little known and little used. In fact, it seems
> to
> feature less in the statistical literature (strict sense) than in
> literature
> in several disciplines applying statistics.
> Most mathematical statistics books make immensely more of the distinction
> between discrete and continuous variables, which cuts across this scheme.
> Even
> that distinction need not dictate all analyses. Population sizes may jump
> up
> or down in steps
> of 1, but in itself this is no inhibition to fitting curves based on
> differential calculus.
> Conversely, atmospheric temperature sounds like an obviously continuous
> variable, until you notice
> that in practice there is a resolution level with human-recorded values of
> 0.1 deg and
> that many observers prefer to write down even last digits (0(2)8) rather
> than odd (1(2)9),
> which seems somewhat at odds with the physics of heat.
> There are in my view several things wrong or incomplete with the NOIR
> scheme. This is
> not a full list.
> 1. The distinction interval/ratio is only rarely of importance. (However,
> I
> recently
> saw writers agonise in print over negative coefficients of variation for
> temperatures
> when the mean was below zero Celsius. This served as a reminder that
> "rarely" does
> not mean "never".)
> 2. The category ordinal covers a wide range of possibilities. Pure ranks
> (with no ties) have a very rich mathematical structure, while what might
> be
> called
> grades (e.g. "excellent" "good" ... "execrable") are very different in
> terms
> of
> what is usually appropriate either descriptively or in terms of modelling.
> Lumping those together as ordinal is not very helpful. Also, the principle
> that,
> when you and I grade (say) our favourite Stata commands, my "excellent" is
> distinct
> from my "good" is separate from the principle that my "good" is equal to
> your "good",
> which appears fundamental to proper ordinal analysis, yet is (a) probably
> dubious
> and (b) perhaps untestable.
> 3. The scheme predates most modern categorical data analysis. Many
> explanations miss the elementary but also fundamental flexibility which
> we have in _representing_ categorical data in different but equivalent
> ways.
> Thus while
> {"male", "female"} looks nominal, (sex == "female") yielding 1 or 0 is
> something we can quite happily take averages of or include in regression
> or
> other
> models, as is frequency of females. That example should
> be widely familiar, but the principle is more general. To put it another
> way,
> naive accounts suppose that variables are necessarily or inherently of a
> variable type,
> but this conflicts with much of what we know about scientific and
> statistical
> practice.
> 4. Many kinds of variables do not fit into the scheme easily, if at
> all. Variables measured on the circle or sphere as outcome space
> are one example. Perhaps even more widely used are scores based on
> the sum of many separate test items, as seen in education, medicine,
> psychology, etc., etc. Purists often doubt whether such scales are even
> ordinal, while
> universities and medics often act as they were interval or even ratio
> scales.
> 5. Percents and proportions have special properties which lie
> outside the scheme.
> As Hakon points out, Stata's distinctions between different types are
> based
> on
> how values are stored, a computing issue which may be of
> little or no direct concern to most people using statistical methods.
> Some statistical languages go much further than Stata in having
> variable types (or the equivalent in their terminology) such as
> factors and ordered factors, distinctions clearly based on statistical
> meaning.
> I don't know why Stata does not do this, and what exists now clearly
> does not rule out future features _permitting_ users to declare that their
> variables are of particular statistical type, but I can speculate:
> 1. It didn't seem a very interesting or important project, or there
> wasn't a consensus on the matter.
> 2. It was too difficult to implement without raising or causing more
> problems than it solved.
> 3. There is a Stata tradition of assuming that users know what
> they are doing, which might including breaking or stretching
> somebody else's rules. There is always of course a downside, or
> an arguable side. As a teacher, I have often wished that a variable
> produced by -encode- could never be used _as is_ as a variable in
> regression or correlation, and that Stata would send a "howler", Harry
> Potter style, to anyone who tried to do this. But it is also easy to
> imagine
> situations
> in which this could be very reasonable, as when grades "A" to "E" are
> mapped to 1/5. (Flipping to 5/1 is a matter of 6 - grade.)
> As for tutorials, I am currently writing something for the Stata
> Journal on numeric and string variables.
> On the specific matter of graphics, categorical data held as string
> variables might feature in graphs in two fundamentally different ways:
> a. They define classes, and the main concern is to show the
> associated frequencies.
> b. They define identifiers, and the main concern is to show
> these in a legend.
> Arguably, official Stata has neglected both kinds of graph.
> For example, neither -graph, bar- nor -graph, pie- is about showing
> frequencies directly, although each can be persuaded into
> doing that. The FAQ which Hakon alluded to at
> refers to various approaches, and more could be said. But what are
> programmers
> not providing? (No requests for three-dimensional pie charts
> will be entertained by the undersigned.)
> Nick
> [email protected]
> *
> *   For searches and help try:
> *
> *
> *
*   For searches and help try:

© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index