Home  /  Products  /  Features  /  Long strings

<-  See Stata's other features

Highlights

  • Each string can hold up to two billion characters

  • All string functions work with them

  • All of Stata works with them

  • Read files into them, write them to files—Word documents, JPEG images, etc.

  • Plain text (ASCII) and binary, including BLOBs—binary large objects

You can now use Stata’s string variables to hold exceedingly long strings, even the contents of files or even binary files.

Say we have data on 500 patients stored in our Stata dataset patients.dta.

We have doctor notes stored in 500 other files with names like notes17213.xyz, notes18417.xyz, and so on. The number in the filename is the patient’s ID.

We have the variable patid containing the patient IDs.

We can read all 500 files into our dataset:

. generate strL notes = fileread("notes_" + string(patid) + ".xyz")

Just as easily, we could re-create all 500 files.

We want to know whether the phrase “Diabetes Mellitus Type 1” appears in the doctor’s notes, which the doctor would have written as T1DM. We could type

. generate t1dm = ( strpos(notes, "T1DM") != 0 )

to create variable t1dm, which flags whether the note is in each file.

We could also type

. list glucose age weight if strpos(notes, "T1DM")

to list the variables sugar level, patient age, and patient weight wherever the doctor recorded Diabetes Mellitus Type 1.

We could even type

. regress glucose age weight if strpos(notes, "T1DM")

to run a regression of sugar level on age and weight.

Now for some details ...

What’s a string?

A string is a sequence of characters:

        Samuel Smith
        California
        U.K.

Strings can be stored in Stata datasets as string variables.

. webuse auto
(1978 Automobile Data)

. describe make

Variable      Storage  Display     Value
    name         type   format     label      Variable label
make str18 %-18s Make and model

What does storage-type str18 mean?

The variable make is a str18 variable. It can contain strings up to 18 characters long. The strings are not all 18-characters long:

. list make in 1/2

make
1. AMC Concord
2. AMC Pacer

All str18 means is that the variable cannot hold a string longer than 18 characters. Even that is unimportant because Stata automatically promotes str# variables to be longer when required:

. replace make = "Mercedes Benz Gullwing" in 1 
make was str18 now str22
(1 real change made)

The string-variable storage types are str1, str2, ..., str2045, and strL.

strL?

Think of it like this: after 2,045 comes L. The L stands for long. strL is pronounced sturl.

strL variables work just like str# variables:

. webuse auto, clear
(1978 Automobile Data)

. generate strL mymake = make 

. describe mymake

Variable      Storage   Display    Value
    name         type    format    label      Variable label
mymake strL %9s
. list mymake in 1/2
mymake
1. AMC Concord
2. AMC Pacer

strL variables can be exceedingly long, but that is not required.

We can replace strL values just as we can replace str# values:

. replace mymake = "Mercedes Benz Gullwing" in 1
(1 real change made)

We can use string functions with strL variables just as we can with str# variables:

. generate len = strlen(mymake)

. generate strL first5 = substr(mymake, 1, 5)

. list mymake len first5 in 1/2

mymake len first5
1. Mercedes Benz Gullwing 22 Merce
2. AMC Pacer 9 AMC P

We can even make tabulations:

. generate strL brand = word(mymake, 1)

. tabulate brand

brand Freq. Percent Cum.
AMC 2 2.70 2.70
Audi 2 2.70 5.41
BMW 1 1.35 6.76
(output omitted)
VW 4 5.41 98.65
Volvo 1 1.35 100.00
Total 74 100.00

strLs can be binary!

strLs can hold binary strings. A binary string is, technically speaking, any string that contains binary 0. Here is a silly example:

. webuse auto, clear
(1978 Automobile Data)

. replace make = "a" + char(0) + "b" in 1
(make was str18 now strL)
(1 real change made)

. list make in 1
make
1. a\0b

list displays binary zeros as \0.

str# variables cannot contain binary 0. strL variables can.

Tell me more

Read all about long strings and BLOBs in the manual entry.