Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Re: String variables over 244 in a dataset with two delimiters

From	Nick Cox <[email protected]>
To	[email protected]
Subject	Re: st: Re: String variables over 244 in a dataset with two delimiters
Date	Thu, 22 Sep 2011 14:35:31 +0100
What's implicit, I hope, is that I am guessing is that the best
strategy for Adam's specific problem is to separate out the long
variable, in which case it can be parsed on semi-colons and merged
back in somehow.

I am not keen on trying to write a program for Adam's mix of tabs
delimiting variables and semi-colons also being used within the
longest string.

Here is a nth field program. It selects the n th field from each line
(record) of a text file and puts it elsewhere. Asking for a nth field
that does not exist or a nth field being empty is not a problem; empty
strings are returned in each case. I can't guarantee that this copes
with all problems and would be pleased to hear of cleaner approaches.

*! NJC 1.0.0 22 Sept 2011
program nthfield
	version 9
	syntax anything(name=files) [, N(int 1) DELIMiter(str) ]

	gettoken data files : files
	gettoken field files : files
	if "`data'" == "" | "`field'" == "" | "`files'" != "" {
		di as err "syntax is: " ///
		as txt "nthfield {it:datafile fieldfile}"
		exit 198
	}

	confirm file "`data'"
	confirm new file "`field'"

	if "`delimiter'" == "" local sep = char(9)
	else local sep "`delimiter'"
	
	tempname in out
	file open `in' using "`data'", r
	file open `out' using "`field'", w
	file read `in' line
	
	while r(eof) == 0 {
		mata : _nth("line", `n', "`sep'")
		file write `out' `"`line'"' _n
		file read `in' line
	}
	file close `out'
end

version 9
mata :

void _nth(string scalar macname, scalar n, string scalar sep) {
	string rowvector fields
	string scalar nth
	scalar nf, nsep, j

	fields = tokens(st_local(macname), sep)
	nf = cols(fields)
	nth = ""

	if (sep == "") {
		if (n <= nf) nth = fields[n]
	}
	else {
		j = nsep = 0
		while (nsep < (n - 1) & j < nf) {
			if (fields[++j] == sep) nsep++
		}
		if (j < nf) {
			if (fields[j + 1] != sep) nth = fields[j + 1]
		}
	}

	st_local(macname, nth)
}

end

/*
A field is part or all of a record. Fields are by default delimited by
spaces, in which case to Stata they are also words, or they may be
delimited by some other delimiter, which must be specified.

There are at least two possible problems here. One is that the n th
field may not exist in the sense that there are fewer fields in the
record. This is the sense in which "frog toad newt" has three fields
and its fourth and higher fields can only be returned as empty strings.

The other is that the field is defined but is implicitly empty. In that
case the field is not recorded as a token, as in

. mata : tokens("1;2;;4", ";")
       1   2   3   4   5   6
    +-------------------------+
  1 |  1   ;   2   ;   ;   4  |
    +-------------------------+

where the third field is empty. The main rule followed here is that the
n th field is found by finding the (n - 1)th delimiter and looking at
the next token, but we also need to be able to find the first field,
which is easy enough, and to cope if we do not find that delimiter or it
is itself the last token.
*/




On Thu, Sep 22, 2011 at 1:33 PM, Nick Cox <[email protected]> wrote:
> I wouldn't expect a Google on "state file command" to be very helpful
> here, or even one with "Stata" substituted! In general, I start
> searches within Stata first, and within the Statalist archives second.
> Also, although there is much good introductory teaching material
> scattered over the internet I think it is rare to find good
> explanations of higher level Stata material outside the usual sources,
> Stata documentation, Stata Press publications, Stata Journal or the
> Statalist, although clearly there is no rule that stops people posting
> where they wish.
>
> I just looked at the help for -file- and can see a full explanation of
> the syntax and a worked example, so it is not clear what else you
> seek. I'd look at complete concrete programs using -file-. One such is
> -log2html- from SSC. A simpler one is -labvalclone- from -labutil- on
> SSC.
>
> A possibly helpful hint is to underline that -file- itself really does
> very little, although what it does is crucial. So there is nothing
> else to learn about -file- itself that is not documented (so far as I
> know). The nub of any problem is how to process each line of a file in
> the way that you want, which calls for quite different commands,
> sometimes but not always trivial.
>
> I wrote a couple of helper programs. One extracts the longest field
> from each line of a file and sends it somewhere else. Another takes a
> line of a file and chops it into fields of at most some maximum
> length. In each case a delimiter can be specified but defaults to tab.
> The longest field file is not smart about ties, and is not guaranteed
> to select the same field each time. I am working on a program to
> select the n th field, a trickier problem given that it may not exist
> or may be empty.
>
> Examples first. Know that tabs are delimiters in all these. A key
> message is embedded in the test file.
>
> . type test.txt
> 1       2       frog toad
> 3       4       very long string indeed
> 5       6       Mata makes matters much more manageable
>
> . longestfield test.txt longest.txt
>
> . type longest.txt
> frog toad
> very long string indeed
> Mata makes matters much more manageable
>
> . stringchop longest.txt chopped.txt, max(10)
>
> . type chopped.txt
> frog toad
> very long       string ind      eed
> Mata makes       matters m      uch more m      anageable
>
> *! NJC 1.0.0 22 Sept 2011
> program longestfield
>        version 9
>        syntax anything(name=files) [, DELIMiter(str) ]
>
>        gettoken data files : files
>        gettoken field files : files
>        if "`data'" == "" | "`field'" == "" | "`files'" != "" {
>                di as err "syntax is: " ///
>                as txt "longestfield {it:datafile fieldfile}"
>                exit 198
>        }
>
>        confirm file "`data'"
>        confirm new file "`field'"
>
>        if "`delimiter'" == "" local sep = char(9)
>        else local sep "`delimiter'"
>
>        tempname in out
>        file open `in' using "`data'", r
>        file open `out' using "`field'", w
>        file read `in' line
>
>        while r(eof) == 0 {
>                mata : _longest("line", "`sep'")
>                file write `out' `"`line'"' _n
>                file read `in' line
>        }
>        file close `out'
> end
>
> mata :
>
> void _longest(string scalar macname, string scalar sep) {
>        string rowvector fields
>        string scalar longest
>        scalar j, length
>        length = 0
>
>        fields = tokens(st_local(macname), sep)
>        for(j = 1; j <= cols(fields); j++) {
>                if (strlen(fields[j]) > length) {
>                        length = strlen(fields[j])
>                        longest = fields[j]
>                }
>        }
>
>        st_local(macname, longest)
> }
>
> end
>
> *! NJC 1.0.0 22 Sept 2011
> program stringchop
>        version 9
>        syntax anything(name=files) ///
>        [, MAXimum(numlist int >0 <245) DELIMiter(str) ]
>
>        gettoken data files : files
>        gettoken outfile files : files
>        if "`data'" == "" | "`outfile'" == "" | "`files'" != "" {
>                di as err "syntax is: " ///
>                as txt "stringchop {it:datafile outfile}"
>                exit 198
>        }
>
>        confirm file "`data'"
>        confirm new file "`outfile'"
>
>        if "`maximum'" == "" local max = 244
>        else local max "`maximum'"
>
>        if "`delimiter'" == "" local sep = char(9)
>        else local sep "`delimiter'"
>
>        tempname in out
>        file open `in' using "`data'", r
>        file open `out' using "`outfile'", w
>        file read `in' line
>
>        while r(eof) == 0 {
>                mata : _stringchop("line", "`sep'", `max')
>                file write `out' `"`line'"' _n
>                file read `in' line
>        }
>        file close `out'
> end
>
> mata :
>
> void _stringchop(string scalar macname, string scalar sep, scalar max) {
>        string scalar strin, strout
>        scalar j, strlength
>
>        strin = st_local(macname)
>        strlength = ceil(strlen(strin) / max)
>        strout = ""
>        for(j = 1; j < strlength; j++) {
>                strout = strout + substr(strin, 1 + (j - 1) * max, max) + sep
>        }
>        strout = strout + substr(strin, 1 + (strlength - 1) * max, .)
>
>        st_local(macname, strout)
> }
>
> end
>
>
> Nick
>
> On Thu, Sep 22, 2011 at 1:56 AM, Ozimek, Adam <[email protected]> wrote:
>> How to correctly us the file command is not clear to me from the help file. Is there a longer online tutorial to be found? It is an unfortunately named command in that searching for "state file command" is not very helpful in google.
>>
>> There is more than one variable with a semi-colon in it, and so just replace all semi-colons with tabs will cause a bit of confusion, so I'm guessing I need to step the the file one line at a time and use the file command rather than filefilter.
>>
>>
>> ________________________________________
>> From: [email protected] [[email protected]] On Behalf Of Austin Nichols [[email protected]]
>> Sent: Tuesday, September 20, 2011 10:21 AM
>> To: [email protected]
>> Subject: Re: st: Re: String variables over 244 in a dataset with two delimiters
>>
>> Joseph Coveney <[email protected]>:
>> Good answer.  If some substrings delimited by semicolons are greater
>> than 244 characters in lengths, and you want to keep all information,
>> you can also use -file- to step through the file one line at a time
>> and save bits of longer strings as separate variables, e.g. in 100
>> character chunks.
>>
>> On Mon, Sep 19, 2011 at 11:40 PM, Joseph Coveney <[email protected]> wrote:
>>> Adam Ozimek wrote:
>>>
>>> I have a dataset that is tab delimited, and one of the variables is a string
>>> that can be over 244 characters. If I read this using insheet, or inputst, or I
>>> think anything else, it truncates this variable. However, there is an aspect of
>>> the string variable that I hope will let me get around this: it is delimited by
>>> semicolon. Is there a way to select one of the columns in a tab delimited
>>> dataset, and read in by parsing it as semi-colon delimited? Is there some
>>> otherway to rescue the long variable without the truncation?
>>>
>>> --------------------------------------------------------------------------------
>>>
>>> There are a couple of ways to approach this problem, but I think that the most
>>> direct is to use Stata's -filefilter- command to convert semicolons to
>>> double-quote + tab + double-quotes, and then read the converted file in with
>>> -insheet-.  (To learn more about-filefilter-, see Stata's online help for the
>>> command or see its entry in the user manual.)
>>>
>>> Notes:
>>>
>>> 1. This assumes that your string column's contents are surrounded by
>>> double-quotation marks.  If not, then just convert the semicolons to tabs alone.
>>>
>>> 2. If your tab-delimited file has a header row (column names), then remember to
>>> insert a new name for your newly created column.  There are a couple of ways to
>>> do that, too, in Stata, but again -filefilter- might be the most direct.
>>>
>>> 3. Don't overwrite your original.  (I'm not sure that -filefilter- will even
>>> allow you to name <newfile> the same as <oldfile>, but if it does, don't do it.)
>>>
>>>
>>> 4. The converted file can be a temporary file by using -tempfile- in conjunction
>>> with -filefilter-.  This makes the project's intermediate-file-cleanup chores
>>> easier.
>>>
>>> Joseph Coveney
>>
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/statalist/faq
>> *   http://www.ats.ucla.edu/stat/stata/
>>
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/statalist/faq
>> *   http://www.ats.ucla.edu/stat/stata/
>>
>

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
Follow-Ups:
- Re: st: Re: String variables over 244 in a dataset with two delimiters
  - From: Nick Cox <[email protected]>
References:
- st: Re: String variables over 244 in a dataset with two delimiters
  - From: "Ozimek, Adam" <[email protected]>
- Re: st: Re: String variables over 244 in a dataset with two delimiters
  - From: Nick Cox <[email protected]>
Prev by Date: R: st: qladder
Next by Date: RE: R: st: qladder
Previous by thread: Re: st: Re: String variables over 244 in a dataset with two delimiters
Next by thread: Re: st: Re: String variables over 244 in a dataset with two delimiters
Index(es):
- Date
- Thread