Statalist The Stata Listserver


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: RE: string variable problem


From   "Nick Cox" <[email protected]>
To   <[email protected]>
Subject   st: RE: string variable problem
Date   Wed, 20 Jun 2007 20:53:50 +0100

Eric (Ric) Uslaner asked (edited slightly) 
 
> I have a string variable problem. One variable in a dataset is 
> composed of both the name of a country and the year of a survey, 
> such as:
> 
> Albania2002
> Albania2005
> Serbia&Montenegro2002
> Serbia&Montenegro2005
> 
> I want to drop the last four digits. (There is already a variable 
> called -year-.)  I figured out how to put the last four digits into 
> another variable with -substr()- but cannot figure out how to keep 
> country names (which vary in length) and drop the year. 

Solutions were suggested by Kit Baum, Pete Huckelba, Svend Juul, and 
Carole J. Wilson. 

Let's assume Ric's variable is called -id-. 

How do you try to solve a problem like this? 

Step 0: Diagnosis
=================

What kind of a problem is this? Will there be a straightforward 
solution using existing Stata functionality or will someone (perhaps 
you) need to write a program? 

This is a problem in data management with string variables. 
Ric's guess (presumably) was that it is not esoteric: no 
program should be needed and existing functionality should suffice. 
That was correct. Such a guess narrows it down to something documented 
in [U] or [D]. 

Ric wants to omit the last 4 characters specifying year. 
The twist that gives the problem a little extra spin is the 
irregular length of the string. If the string was always (say) 
16 characters long then the solution would be immediate as 
-substr(id, 1, 12)-. 

Step 1: -generate-?
===================

We just want to extract some of the characters from a string variable
and make a new variable, or equivalently to omit the other characters. 

That should suggest a -generate- command. Even though Ric 
wants to throw away some of the characters, using -replace- would be worse 
style. You might mess up your data if you get a -replace- wrong, or
(at least in other loosely related problems), you might change
your mind later and want to use what you just threw away. 

The immediate question is thus how to specify the rule for the right-hand side
in a -generate- solution. 

A further question is whether there are other ways to do it. 
We'll get to that in a moment. 

Step 2: functions? 
==================

I typically consider next whether any existing functions
will do the job. Functions fall into two classes, those you 
know you want to use and those you don't know you should use. 

What is crucial is that many jobs require two or more 
functions working together. 

I suspect that many people skim through the list of 
functions and are often then disappointed that nothing 
matches their problem exactly. There is no function 
that omits the last # characters. It would be easy 
enough for StataCorp to double the number of functions
by adding many more, including that one, but that would 
not really double versatility, just complexity. The 
toolkit philosophy is to provide tools that individually 
do one thing, but in conjunction can solve a larger
variety of problems. 

-substr()- and -length()-
-------------------------

Kit Baum and Svend Juul both suggested using -substr()- 
together with -length()-. Svend's solution is on
these lines: 
 
gen length = length(id)
gen newid = substr(id, 1, length - 4)

Kit's solution is on these lines: 

gen newid = substr(id, 1, length(id) - 4) 

These are really the same solution. Svend does
it in two steps, Kit in one. If you find Svend's 
solution clearer, go with it. The main cost is just another
variable that you probably don't care about 
otherwise. You could -drop- it once it has 
outlived its usefulness. 

In more detail: 

* -length(strvar)- returns the length of the 
contents of a string variable -strvar-. 
(-length("strvar")- would give you the length
of the _name_ of the variable, in this case 6. 

* -gen length = length(id)- gives a new 
variable. -id- has different lengths, 
as Ric pointed out, but this is not a problem. 
The value of -length- will (literally) vary, 
accordingly. 

* -length(id)- is going to include all string
characters, including any leading and trailing 
blanks. If necessary, find out about -trim()-. 
Here's a give-away: 

gen newid = substr(trim(id), 1, length(trim(id)) - 4) 

takes care of such blanks also. 

Note that if Ric had wanted -year- as well, then 
that would be 

gen year = substr(id, length(id) - 3, 4)

with the same caveat about blanks. 

-reverse()- and -substr()-
--------------------------

Pete Huckelba's solution was along these lines: 

. gen newid = reverse(substr(reverse(id),5,.))

This is another "it takes two to tango" 
solution. If we -reverse()- a string, then 
the last character becomes the first. Chop
off the first four characters, previously
the last four, and then -reverse()- the
reversed string to get back to where you want. 

Possible problems with blanks would be dealt 
with in the same way: 

. gen newid = reverse(substr(reverse(trim(id)),5,.))

Step 3: Consider other commands
===============================

The problem is solved, but knowing about other
possible solutions is also worthwhile. 

Carole Wilson suggested the use of -egen, ends()-. 

Her solution is along these lines: 

If all years begin with "2":

egen newid = ends(id), punct(2) head

If you have dates from last century: 

egen newid = ends(id), punct(1) head

However, this solution is problematic. You 
are making an assumption that characters 
like "1" and "2" do not occur as part of 
country names. Now I am a geographer and 
some people expect me to know about such 
things, but I wouldn't want to rule out
such a possibility. More importantly, it 
seems very likely that some years in Ric's 
data begin with "1" and some with "2", 
and -egen, ends()- does not work especially
well with that situation. 

Much the same applies to -split-, which no 
one mentioned. It was not really designed
for Ric's kind of problem, and although 
it would be more help than nothing, 
a solution with functions seems to me much 
better.

Nick 
[email protected] 

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index