How do I remove leading or trailing zeros from string variables?
|
Title
|
|
Removing leading or trailing zeros from string variables
|
|
Authors
|
Nicholas J. Cox, Durham University, UK
Jamie Griffin, London School of Hygiene and Tropical Medicine, UK
|
|
Date
|
September 2005
|
Suppose you wish to remove leading or trailing zeros from a string
variable (or from a global or local macro). To be clear on terminology here,
a string may contain zeros in leading positions, such as
"0string"; in trailing positions, such as
"string00"; in both; or in some intermediate position,
such as "string000string". Here
string stands for any string containing characters other than zero
"0".
This problem is interesting because it provides an opportunity to consider the operation
of various string functions. When we say zeros, in principle we
could say any other character, except leading and trailing blanks
should be removed using
trim(),
ltrim(), or
rtrim().
To make the problem as general and as challenging as possible, assume that
we do not know in advance how many leading or trailing zeros there are and
zeros may also occur in intermediate positions.
However, let us recall what to do if we do know how many leading or trailing
zeros occur and zeros do not occur in intermediate positions. As we
said, you can use these solutions with some modification to remove other
characters. If we knew there was just one leading zero, and no other
zero, we could remove it by changing it to an empty string using the
subinstr()
function by typing either
. replace myvar = subinstr(myvar, "0", "", 1)
or
. replace myvar = subinstr(myvar, "0", "", .)
The last argument of subinstr() is just the number of instances to be
replaced, either the first or all of them, and has the same effect if at
most one zero is present. This solution also applies to at most one trailing
zero and to no others. However, subinstr() can remove characters
we want to keep.
Another line of attack is to use the
substr()
function. This works best if we know precisely how many characters to
remove. Thus typing
. replace myvar = substr(myvar, 2, .)
and
. replace myvar = substr(myvar, 1, length(myvar) − 1)
respectively would remove the first and last character only. We could make
that conditional on the character in question being "0", as in
. replace myvar = substr(myvar, 2, .) if substr(myvar,1,1) == "0"
and
. replace myvar = substr(myvar, 1, length(myvar) − 1) if
> substr(myvar,−1,1) == "0"
which for our problem would do no harm and possibly some good.
However, we have yet to solve the general problem posed above. One idea is
to keep removing zeros, so long as we find them, in a loop, as in
. quietly count if substr(myvar, 1, 1) == "0"
. while r(N) {
. replace myvar = substr(myvar, 2, .) if substr(myvar, 1, 1) == "0"
. count if substr(myvar, 1, 1) == "0"
. }
Here the result of
count is left
behind in r(N). If that is not zero, i.e., if r(N) is one or
more, Stata enters the while loop. We remove the
first zero whenever there is one and count again. Stata will exit
the loop as soon as r(N) becomes zero. A similar technique could be
used to remove an arbitrary number of trailing zeros, testing with
substr(myvar, −1, 1) == "0" and removing with substr(myvar,
1, length(myvar) − 1).
This flexes a few small programming muscles but otherwise is less attractive
than a better way that makes use of another function.
indexnot() finds
the position of the first character other than the one specified.
. replace myvar = substr(myvar, indexnot(myvar, "0"), .)
Note that if myvar did not contain leading zeros, indexnot()
would return 1 and myvar would just be replaced by itself. One way
to find trailing zeros with this approach is to
reverse() the string
before and after applying the same logic:
. replace myvar = reverse(substr(reverse(myvar), indexnot(reverse(myvar), "0"), .))
Cautious souls will want to see this done step by step:
. generate rev = reverse(myvar)
. replace rev = substr(rev, indexnot(rev, "0"), .)
. replace myvar = reverse(rev)
. drop rev
|