How do I split a string variable into parts?
|
Title
|
|
Splitting a string variable into parts
|
|
Author
|
Nicholas J. Cox, Durham University, UK
|
|
Date
|
October 2001; minor revision February 2002; updated April 2003; updated August 2005
|
Question
I want to split a string variable. The variable case names court
cases, and I would like to have separate variables for plaintiff and
defendant. They are divided by "V" or "VS" or
"V." or "VS.". The number of words on either side of the
divide is not constant.
For example:
STAPF V US
BUNTEN V CUMBERLAND T ATTY
PIPER VS USA
MCCAMMON JR V US BOARD OF PAROLES
DOE VS. SUE GRABBIT RUNNE
Answer
If you have Stata 8 or later, the answer is to type
. split case, p(" V " " VS " " V. " " VS. ")
and you can bail out now, unless you are interested in how to solve
the problem from first
principles. If you have Stata 7, a previous version of
split is available from SSC, and you can bail out now, unless
you too want to keep reading.
This is a nice example of how a problem can be easy for people to specify.
The challenge is to translate it into Stata. For example, although
egen provides
several functions for subdividing string variables, this problem, like many
others, is best tackled by using the basic
string
functions.
We need first to find the position of " V " or " VS " or "
V. " or " VS." within case. Use the function
strpos(). (In Stata 8, this function was called index().)
strpos("string","chars") returns the
starting point of the first occurrence of "chars"
within "string", and strpos(strvar,
"chars") does the same for the values of a string variable
strvar. Thus strpos("frog toad", "o") is 3 because the first
occurrence of "o" starts at the 3rd character of "frog toad".
If there is no such occurrence, the result is 0. strpos("frog","a")
is 0, as "a" never occurs within "frog".
So, we look for the position of " V ",
. generate splitat = strpos(case," V ")
and, if that doesn't exist, for the position of " VS ",
. replace splitat = strpos(case," VS ") if splitat == 0
and, if that doesn't exist, for the position of " V. ",
. replace splitat = strpos(case," V. ") if splitat == 0
and, if that doesn't exist, for the position of " VS. ",
. replace splitat = strpos(case," VS. ") if splitat == 0
The string we search for includes surrounding spaces. Just searching for
"V" would catch any occurrence of "V" within the plaintiff’s
name, which we do not want. Also, there is an assumption here that we never
get, for example, " V " as part of a plaintiff’s or defendant’s name.
Now it would be worth checking that splitat is never 0:
. list case if splitat == 0
If this is true, we have a problem somewhere, perhaps a typo has occurred,
the dividing element was left out, or someone used lowercase. With a few
problems, it might be easiest to do small-scale surgery within the editor.
In problems with mixtures of case, the functions upper() and
lower() come in handy.
That aside, the first variable needed, plaintiff, is now in reach:
. generate str1 plaintiff = ""
. replace plaintiff = substr(case,1,splitat - 1)
plaintiff is the part of case up to but not including the
position at which we split. plaintiff is generated first as an empty
str1 variable. We can rely on Stata to work out the appropriate
string type when it replaces plaintiff by the desired string. For
example, an alternative would be
. generate str80 plaintiff = substr(case,1,splitat - 1)
to be followed later by
compress. (Your
string variables can be as large as str244.) The substr()
function has three arguments: the string, or string variable, from which we
copy a substring; the position of the start of the substring; and the length
of the substring to be copied. A period (.) as length means "keep
right on to the end of the string".
The second variable, defendant, is (nearly) the rest:
. generate str1 defendant = ""
. replace defendant = substr(case,splitat + 1,.)
Here the + 1 reflects that we don't want the leading space
in " V " or " VS " or " V. " or " VS. " with
which all values start. But we still want to strip the "V ", "VS
", "V. ", or "VS. ", and the way to do it is to look for
the first space that all these have.
. replace defendant = substr(defendant,strpos(defendant," ") + 1,.)
Let us look at the arguments to substr() more closely. Find the
position of the first space within defendant (by construction,
it’s the space following "V" or "VS" or "V. " or
"VS."). That is,
strpos(defendant, " ")
but we want to start after that, so we want as our starting position
strpos(defendant, " ") + 1
and we keep right on the end of the string. Below we now have
. l case plaintiff defendant
+-----------------------------------------------------------------------+
| case plaintiff defendant |
|-----------------------------------------------------------------------|
1. | STAPF V US STAPF US |
2. | BUNTEN V CUMBERLAND T ATTY BUNTEN CUMBERLAND T ATTY |
3. | PIPER VS USA PIPER USA |
4. | MCCAMMON JR V US BOARD OF PAROLES MCCAMMON JR US BOARD OF PAROLES |
5. | DOE VS. SUE GRABBIT RUNNE DOE SUE GRABBIT RUNNE |
+-----------------------------------------------------------------------+
In a more extensive dataset, we would check for names that were empty or
just contained spaces:
. l plaintiff defendant if trim(plaintiff) = "" | trim(defendant) == ""
Strings that are empty or contain one or more spaces all trim to empty.
Here is the code gathered in one place:
. generate splitat = strpos(case," V ")
. replace splitat = strpos(case," VS ") if splitat == 0
. replace splitat = strpos(case," V. ") if splitat == 0
. replace splitat = strpos(case," VS. ") if splitat == 0
. list case if splitat == 0
. generate str1 plaintiff = ""
. replace plaintiff = substr(case,1,splitat - 1)
. generate str1 defendant = ""
. replace defendant = substr(case,splitat + 1,.)
. replace defendant = substr(defendant,strpos(defendant," ") + 1,.)
. l case plaintiff defendant
. l plaintiff defendant if trim(plaintiff) == "" | trim(defendant) == ""
|