Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: Help with destring


From   n j cox <n.j.cox@durham.ac.uk>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Help with destring
Date   Mon, 16 Jul 2007 19:19:56 +0100

The postings in this thread have been edited slightly, principally to
follow -cmdname- and -functionname()- conventions.

Donald Spady asked

------------------------------------------------------------------------
I have data that are string data and want to convert them to numeric. I
would expect -destring- to do this but it doesn't. For example, here is
a table

. tab Worker_Type

worker Freq. Percent Cum.

Administration 12 9.16 9.16
Allied 44 33.59 42.75
Attending 26 19.85 62.60
Clerk 9 6.87 69.47
Med Student 17 12.98 82.44
Resident/Fellow 23 17.56 100.00

Total 131 100.00

The values are all strings.

When I do

. destring worker, generate(nworker)

I get the message

"worker contains nonnumeric characters; no generate"

It seems to me that if I want to convert a string, that implicitly means
there are nonnumeric characters. What am I doing wrong? I am using
Stata 10.
------------------------------------------------------------------------

Joseph Coveney suggested

------------------------------------------------------------------------
Roughly stated, -destring- converts string numerals into numerical data.
It's not clear what you're expecting the numerical data version of worker to be, but it seems as if you want something like -encode-.
------------------------------------------------------------------------

Ted Anagnoson suggested

------------------------------------------------------------------------
When I have numbers that are strings, I almost always have good luck
with the -real()- function, rather than -encode-/-decode- or -destring-,
etc.
------------------------------------------------------------------------

It's not difficult to get confused on Stata facilities for string to
numeric conversion, or its inverse, numeric to string conversion. Here's
an attempt at a tutorial (which necessarily and deliberately leaves out
some details). In particular, Mata is not covered here.

There is a longer tutorial at

Cox, N.J. 2002. Speaking Stata: On numbers and strings. Stata Journal
2(3): 314--329.

It's a little out-of-date, as publication preceded the adoption of -tostring- as an official command, but the main ideas are covered.

1. Stata functionality for conversion back and forth between numbers and
strings is provided as

* functions -real()- and -string()-

* commands -encode- and -decode-

* commands -destring- and -tostring-.

Trying to remember that these features are paired up will help you in
learning and using them.

2. The two functions can be used for isolated conversions on particular
numbers or strings . -real("42")- takes the string "42" and yields the
number 42 and -string(42)- does the opposite. Those examples show the
main principle. Applied to variables, we can feed either function a
variable name, and it will work on the values of that variable:

. sysuse auto
. gen smpg = string(mpg)
. gen nmpg = real(smpg)
. assert nmpg == mpg

With -assert-, no news is good news, and silence indicates consent. The
conversion back and forth between numeric and string has left the
contents of the variable exactly as they were. In fact, that experiment
could have been condensed:

. assert real(string(mpg)) == mpg

So far, so good. However, -mpg- is a very well behaved variable: its
values are all small integers, and it is never missing. Things are not
always so simple.

3. -real()- and -string()- are simple and direct, but use brute force
and have little built-in intelligence, like husbands, sons-in-law and so forth. Here are three examples to underline their limitations:

. di real("50%")
.

. di real("50,1")
.

. di string(123456789)
1.23e+08

Thus, a warning: both -real()- and -string()- can lose information
present in your data!

The examples with -real()- show that it will shrug its shoulders given
the presence of any character it regards as non-numeric, and so return
numeric missing as a result. Your idea of what is a non-numeric
character may well not match Stata's, but that is your problem! If
missing is not what you want, you need another solution, probably
involving -destring-, on which more in a moment.

The example with -string()- shows that -string()- has a default format.
A careful look at the help for -string()- reveals a two-argument
version, so that, for example,

. di string(123456789, "%12.0g")
123456789

is more likely to be what you want with a nine-digit integer. (Details:
Note that you need to use a numeric format, even though the result is a
string. Also, err on the generous side with formats. Neither "%9.0g" nor
"%10.0g" will replicate the last example, although "%11.0g" will.)

4. Early versions of Stata introduced two ways of holding text or string
information in variable form, indirectly via value labels and directly
as string variables. Historically, the indirect way came first, and
Stata continues to prefer it for various reasons. That is, the
presumption is that you should define some value labels, attach them to
one or more numeric variables, and thereafter get the best of both
worlds while using them for statistics, data management and graphics.

For example, consider the difference between

- inputting a numeric variable, say -report-, with values 1/5

- defining a set of value labels, say

. label define quality 1 "dire" 2 "poor" 3 "moderate" ///
4 "good" 5 "excellent"

- and linking the two

. label val report quality

and, on the other hand,

- inputting a string variable with values "dire" ... "excellent".

Let's remind ourselves of some advantages of the value label approach,
which at first sight may seem needlessly round-about.

* Once assigned, Stata tends to use value labels where natural, as in
tables, in the data editor, and on graphs. (This is not so much an
advantage, as a reassurance that you won't lose out on the textual
information in value labels wherever it is likely to be informative.)

* You need less typing in total if you enter the data yourself. (This
may seem laughable now for those readers who never type in data, but was
not so laughable in the early years of Stata.)

* It's more efficient to store variables as integers, especially small
integers, and an associated set of value labels, than as strings.
(Broadly speaking, this probably bites less than it did, but Stata
datasets may often be too large for your comfort, or your memory.)

* You can use the same value labels wherever they apply. So, a series of
variables recording yes/no answers, or a series of five-point attitude
answers, can be assigned just one set of value labels.

* Stata will pay attention to your ordering. The string values "dire" ... "excellent" sort to "dire" "excellent" "good" "moderate" "poor", not
an order likely to be useful, except to lexicographers. The numeric
values 1 ... 5 will, naturally, sort as expected.

* Stata can work with your numbering scheme when you want to, as when
you decide to ignore measurement pessimists and take means of ordered
scores. (However, there isn't protection from more foolish analyses,
especially with arbitrarily coded nominal variables.)

* Some Stata commands require numeric versions of categorical variables.
(This too bites less than it did.)

In contrast, most of the advantages of string variables are more
obvious.

* Inputting string variables directly is likely to be convenient, and
often easier than setting up value labels.

* String variables are easy to explain. Some users find value labels
more difficult to understand.

* String variables in general have fewer limits than value labels.

* String variables allow further string processing on substrings.

5. Be all that as it may, -encode- and -decode- offer commands for mapping

numeric variables with value labels attached

<->

string variables.

-decode- maps from numeric to string, and -encode- does the opposite.

This pair of commands offers by far the most straightforward ways of
working with non-numeric text for which there is, or there should be, a
numeric coding. Left to its own devices, -encode- will just define value
labels according to the alphabetic order of distinct values of your
string variable, but that behaviour can be overruled.

6. -decode- and -encode- have long been present in Stata. Slowly but
surely, user experiences, particularly in interfacing with other
programs, led to the identification of other needs. Sometimes,
variables that should have been numeric were input by Stata, or by
users, as string variables. This created a need for what is now
-destring-. Less often, it was realised that variables that were numeric
would in fact be better off as string, so that they could be immunised
from the effects of numeric display formats. All-numeric identifiers and
other numbers that should always be shown without decimal points are examples. That created a need for what is now -tostring-.

The principle that most nearly ties -destring- and -tostring- together
is that they are used in attempts to reverse accidents or mistakes.

The real history was more complicated, as is usually the case. In
particular, the scope of -destring-, before adoption by Stata as an official command, waxed and waned with each fresh analysis of what it should be doing. (What the President of StataCorp put in was later taken out by two Vice-Presidents. I think there is an equation in that.) For those curious, some of the by-ways are hinted at in the FAQ

FAQ . . . . . . . . . . . . . . . . . . . . . . . . . The destring command
3/03 Why doesn't the destring command in Stata include
an encode option?
http://www.stata.com/support/faqs/data/destring.html

In fact, -destring- was originally written because when Stata's data
editor was introduced, some users started typing informative text in the
first few observations of the dataset, as is often done within spreadsheets. Using its "first impressions count" rule, the data editor created each such new column as a string variable. Deleting the first few rows and
putting the same information elsewhere (as variable labels, say) did not
change those variables' status as string. Thus a command was needed
that changed string variables with numeric content to numeric types.

My guess is that a much more frequent reason for -destring- being used
now is that variables are imported whole from files created by
spreadsheets or other programs. Stata's importing is conservative. Just
about any non-numeric character being present is enough for a variable
to be born as string. Whenever that is the wrong way around, some fix is
needed.

-tostring- was first written because the underlying problem, inverting
-destring-, made sense. If an operation and its inverse both make sense,
programmers should provide functions or commands for both, even if they
cannot imagine uses for both. Users who really wanted -tostring-
materialised later, justifying that small piece of programming.

7. The details of -destring- and -tostring- can be looked up in their
their help, but some broad features deserve comment.

* Both are designed to be as safe as possible, so that you should not
lose information accidentally.

* Both have bells and whistles for pre-processing, such as ignoring
particular characters with -destring- and using display formats with
-tostring-.

* If -destring- will not do what you what because your string variable
is really several numeric values, as in "1 2 3 4 5", consider using
-split- first. -split- has a -destring- option, by the way.

* Both -destring- and -tostring- can handle several variables at once.
If you wanted to -encode- or -decode- several variables at once, or
apply -real()- or -string()- the most natural approach would be to do
that within a -foreach- loop.

Nick
n.j.cox@durham.ac.uk
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/




© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index