[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
"Nick Cox" <n.j.cox@durham.ac.uk> |

To |
<statalist@hsphsun2.harvard.edu> |

Subject |
st: RE: RE: RE: RE: RE: Re: RE: may not use time-series operators on string variables |

Date |
Thu, 4 Jul 2002 17:12:09 +0100 |

Hakon Finne > Stata has a number of data types (byte, int, long, float, double, str) but > no explicit syntactical elements for variable types (nominal, ordinal, > interval, ratio). For computational convenience, most of Stata's > statistical and graphing procedures only work if the variable is stored as a > non-string > data type, even if some statistical concepts in themselves do not require > numerical values. But some procedures (e.g. -tabulate-) work on strings as > well and you might never suspect there could be a problem. > > The example from the current thread: Time series data on the state of a unit > could be any of the four variable types. Events could then be calculated as > a change in state from one time to another, but if you want to do this with > time-series operators in Stata, the variable has to be stored as a numerical > data type. > > As long as there is no syntactical way to distinguish the four variable > types, there are other means. Value labels help translate numbers to text > for the reader as a compensation for having to convert textual information > to numerical form. The data management tools for performing these > conversions in Stata abound but perhaps someone could think about drafting > an FAQ or a tutorial on how to use them in the context of variable types (e. > g., "What do I do with my categorical/nominal variables to make them work > and display properly in Stata?"). (There are some already, e.g. on pie > charts.) Hakon's broadening of the question is very interesting. However, if he wants Stata to encapsulate, or even to show a little respect for, the nominal ... ratio scheme, then I have reservations. To focus on what I believe to be the main point, the fourfold distinction nominal/ordinal/interval/ratio (NOIR is a useful mnemonic for those who see a black side to all this) was first proposed by the psychologist S.S. Stevens in 1946, and revised intermittently in various small details before and, in terms of publications, after his death in 1973. http://www.nap.edu/books/0309022452/html/425.html But despite being based, supposedly, on mathematical criteria, it serves badly as a basis for modern data analysis. This has often been pointed out, for example, in discussions started by Velleman and Wilkinson (American Statistician, 1993-1995). What is frequently problematic, in my view, is that this scheme, which on one level is just classificatory terminology, is often associated with a set of dogmas (dogmata?) on what are supposedly valid methods to use with each data type (strictly, measurement scale). This matter seems highly tribal: many texts and courses in (e.g.) psychology or sociology make a great deal of it, but there are also equally numerate disciplines in which it appears to be little known and little used. In fact, it seems to feature less in the statistical literature (strict sense) than in literature in several disciplines applying statistics. Most mathematical statistics books make immensely more of the distinction between discrete and continuous variables, which cuts across this scheme. Even that distinction need not dictate all analyses. Population sizes may jump up or down in steps of 1, but in itself this is no inhibition to fitting curves based on differential calculus. Conversely, atmospheric temperature sounds like an obviously continuous variable, until you notice that in practice there is a resolution level with human-recorded values of 0.1 deg and that many observers prefer to write down even last digits (0(2)8) rather than odd (1(2)9), which seems somewhat at odds with the physics of heat. There are in my view several things wrong or incomplete with the NOIR scheme. This is not a full list. 1. The distinction interval/ratio is only rarely of importance. (However, I recently saw writers agonise in print over negative coefficients of variation for temperatures when the mean was below zero Celsius. This served as a reminder that "rarely" does not mean "never".) 2. The category ordinal covers a wide range of possibilities. Pure ranks (with no ties) have a very rich mathematical structure, while what might be called grades (e.g. "excellent" "good" ... "execrable") are very different in terms of what is usually appropriate either descriptively or in terms of modelling. Lumping those together as ordinal is not very helpful. Also, the principle that, when you and I grade (say) our favourite Stata commands, my "excellent" is distinct from my "good" is separate from the principle that my "good" is equal to your "good", which appears fundamental to proper ordinal analysis, yet is (a) probably dubious and (b) perhaps untestable. 3. The scheme predates most modern categorical data analysis. Many explanations miss the elementary but also fundamental flexibility which we have in _representing_ categorical data in different but equivalent ways. Thus while {"male", "female"} looks nominal, (sex == "female") yielding 1 or 0 is something we can quite happily take averages of or include in regression or other models, as is frequency of females. That example should be widely familiar, but the principle is more general. To put it another way, naive accounts suppose that variables are necessarily or inherently of a variable type, but this conflicts with much of what we know about scientific and statistical practice. 4. Many kinds of variables do not fit into the scheme easily, if at all. Variables measured on the circle or sphere as outcome space are one example. Perhaps even more widely used are scores based on the sum of many separate test items, as seen in education, medicine, psychology, etc., etc. Purists often doubt whether such scales are even ordinal, while universities and medics often act as they were interval or even ratio scales. 5. Percents and proportions have special properties which lie outside the scheme. As Hakon points out, Stata's distinctions between different types are based on how values are stored, a computing issue which may be of little or no direct concern to most people using statistical methods. Some statistical languages go much further than Stata in having variable types (or the equivalent in their terminology) such as factors and ordered factors, distinctions clearly based on statistical meaning. I don't know why Stata does not do this, and what exists now clearly does not rule out future features _permitting_ users to declare that their variables are of particular statistical type, but I can speculate: 1. It didn't seem a very interesting or important project, or there wasn't a consensus on the matter. 2. It was too difficult to implement without raising or causing more problems than it solved. 3. There is a Stata tradition of assuming that users know what they are doing, which might including breaking or stretching somebody else's rules. There is always of course a downside, or an arguable side. As a teacher, I have often wished that a variable produced by -encode- could never be used _as is_ as a variable in regression or correlation, and that Stata would send a "howler", Harry Potter style, to anyone who tried to do this. But it is also easy to imagine situations in which this could be very reasonable, as when grades "A" to "E" are mapped to 1/5. (Flipping to 5/1 is a matter of 6 - grade.) As for tutorials, I am currently writing something for the Stata Journal on numeric and string variables. On the specific matter of graphics, categorical data held as string variables might feature in graphs in two fundamentally different ways: a. They define classes, and the main concern is to show the associated frequencies. b. They define identifiers, and the main concern is to show these in a legend. Arguably, official Stata has neglected both kinds of graph. For example, neither -graph, bar- nor -graph, pie- is about showing frequencies directly, although each can be persuaded into doing that. The FAQ which Hakon alluded to at http://www.stata.com/support/faqs/graphics/piechart.html refers to various approaches, and more could be said. But what are programmers not providing? (No requests for three-dimensional pie charts will be entertained by the undersigned.) Nick n.j.cox@durham.ac.uk * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**References**:**st: RE: RE: RE: RE: Re: RE: may not use time-series operators on string variables***From:*HF <Hakon.Finne@sintef.no>

- Prev by Date:
**st: Re: 'lag' of a string variable** - Next by Date:
**st: -cipolate- now on SSC** - Previous by thread:
**st: RE: RE: RE: RE: Re: RE: may not use time-series operators on string variables** - Next by thread:
**st: Re: 'lag' of a string variable** - Index(es):

© Copyright 1996–2016 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |