[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
st: RE: Problem with embedded quotes in strings when using insheet - SOLVED using Mata
The following probably will be of interest only to Stata 9 users, since it
relies on Mata. I know of no other solution save writing a (non-portable)
(For the curious, this is not simply an arcane programming challenge. My
ultimate goal was to create a program for generating data tables from
multiple-response data that could be easily imported by Excel for charting and
reporting purposes. But I'll leave that discussion for another day.)
Rather than quote my original email, I'll lay out the relevant issues to help
explain why I took the Mata route. First and foremost, Mata does not suffer
from the 80 character limit on strings and string functions imposed by Stata
(though Stata/SE raises the limit to 244, that's still too short for my
purposes). The reason this is a serious limit goes back to my original issue.
To wit, Stata cannot properly input string data that contains quotes (not to be
confused with string data that are quoted, which Stata handles correctly).
Here are some particulars:
The file I'm trying to read is a tab-delimited file created by -outsheet-. It
contains a column of group identifiers, a column of variable names, a column of
variable and/or value labels, and one or more columns of numeric data. Because
the file is created by outsheet, the first row lists variable names. (I -xpose-
the data before -outsheet- to get the first few columns of names and labels).
You can imagine the file as a kind of crosstabulation.
All of this is done via an ado program that generates the data and writes the
file. A very useful feature of the program is that it permits the user to
append results from subsequent runs to the same file, rather than have to create
numerous separate files. This file is named by the user. It is not an
arbitrarily named scratch file.
Because the data are grouped, the append operation works by group. That is, the
new data for each group are appended to their own group so the file appears as
if it had been created in a single run. This is accomplished by simply sorting
the file by group while maintaining the order of rows within each group.
Here is the problem:
In order to sort the file, it is necessary to read the group identifier, which
means each row has to be parsed. As soon as Stata encounters a quote character,
however, it uses internal logic to parse the data into a string variable. In
general, variable or value labels can contain quotes, so I believe it is
impossible to read these data with Stata's -insheet- or -infile- commands.
Here is the Mata solution:
First, -outsheet- to a temporary file the data to be appended. Second, load the
previously created file into a matrix of strings (A), and the newly created file
into a second matrix of strings (B), using Mata's cat() function. Third, stack
these into a single column of a new matrix, C=A\B. Fourth, extract the group
identifier from that column using Mata's substr() function and copy it to
another column in the same matrix C. Fifth, create a third column of row
numbers in matrix C. Sixth, sort matrix C by group and, within group, by row.
Seventh, write out the stacked and sorted column to the previously created file,
overwriting the old contents.
I've omitted a number of programming details related to error checking and
efficiency, but the above are the significant steps. Those who have written ado
programs will note that nearly everything I accomplished could have been done in
Stata if not for the built-in limit on string length and/or (mis)handling of
quotes. It is unfortunate that it took Mata to do this, because it is not
available to anyone using an older release of Stata. On the other hand, as a C
programmer, I can't recommend Mata enough. It's actually better than sliced
* For searches and help try: