Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down at the end of May, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Re: String variables over 244 in a dataset with two delimiters


From   Nick Cox <njcoxstata@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Re: String variables over 244 in a dataset with two delimiters
Date   Thu, 22 Sep 2011 13:33:02 +0100

I wouldn't expect a Google on "state file command" to be very helpful
here, or even one with "Stata" substituted! In general, I start
searches within Stata first, and within the Statalist archives second.
Also, although there is much good introductory teaching material
scattered over the internet I think it is rare to find good
explanations of higher level Stata material outside the usual sources,
Stata documentation, Stata Press publications, Stata Journal or the
Statalist, although clearly there is no rule that stops people posting
where they wish.

I just looked at the help for -file- and can see a full explanation of
the syntax and a worked example, so it is not clear what else you
seek. I'd look at complete concrete programs using -file-. One such is
-log2html- from SSC. A simpler one is -labvalclone- from -labutil- on
SSC.

A possibly helpful hint is to underline that -file- itself really does
very little, although what it does is crucial. So there is nothing
else to learn about -file- itself that is not documented (so far as I
know). The nub of any problem is how to process each line of a file in
the way that you want, which calls for quite different commands,
sometimes but not always trivial.

I wrote a couple of helper programs. One extracts the longest field
from each line of a file and sends it somewhere else. Another takes a
line of a file and chops it into fields of at most some maximum
length. In each case a delimiter can be specified but defaults to tab.
The longest field file is not smart about ties, and is not guaranteed
to select the same field each time. I am working on a program to
select the n th field, a trickier problem given that it may not exist
or may be empty.

Examples first. Know that tabs are delimiters in all these. A key
message is embedded in the test file.

. type test.txt
1       2       frog toad
3       4       very long string indeed
5       6       Mata makes matters much more manageable

. longestfield test.txt longest.txt

. type longest.txt
frog toad
very long string indeed
Mata makes matters much more manageable

. stringchop longest.txt chopped.txt, max(10)

. type chopped.txt
frog toad
very long       string ind      eed
Mata makes       matters m      uch more m      anageable

*! NJC 1.0.0 22 Sept 2011
program longestfield
	version 9
	syntax anything(name=files) [, DELIMiter(str) ]

	gettoken data files : files
	gettoken field files : files
	if "`data'" == "" | "`field'" == "" | "`files'" != "" {
		di as err "syntax is: " ///
		as txt "longestfield {it:datafile fieldfile}"
		exit 198
	}

	confirm file "`data'"
	confirm new file "`field'"

	if "`delimiter'" == "" local sep = char(9)
	else local sep "`delimiter'"
	
	tempname in out
	file open `in' using "`data'", r
	file open `out' using "`field'", w
	file read `in' line
	
	while r(eof) == 0 {
		mata : _longest("line", "`sep'")
		file write `out' `"`line'"' _n
		file read `in' line
	}
	file close `out'
end

mata :

void _longest(string scalar macname, string scalar sep) {
	string rowvector fields
	string scalar longest
	scalar j, length
	length = 0

	fields = tokens(st_local(macname), sep)
	for(j = 1; j <= cols(fields); j++) {
		if (strlen(fields[j]) > length) {
			length = strlen(fields[j])
			longest = fields[j]
		}
	}

	st_local(macname, longest)
}

end
	
*! NJC 1.0.0 22 Sept 2011
program stringchop
	version 9
	syntax anything(name=files) ///
	[, MAXimum(numlist int >0 <245) DELIMiter(str) ]

	gettoken data files : files
	gettoken outfile files : files
	if "`data'" == "" | "`outfile'" == "" | "`files'" != "" {
		di as err "syntax is: " ///
		as txt "stringchop {it:datafile outfile}"
		exit 198
	}

	confirm file "`data'"
	confirm new file "`outfile'"

	if "`maximum'" == "" local max = 244
	else local max "`maximum'"

	if "`delimiter'" == "" local sep = char(9)
	else local sep "`delimiter'"

	tempname in out
	file open `in' using "`data'", r
	file open `out' using "`outfile'", w
	file read `in' line
	
	while r(eof) == 0 {
		mata : _stringchop("line", "`sep'", `max')
		file write `out' `"`line'"' _n
		file read `in' line
	}
	file close `out'
end

mata :

void _stringchop(string scalar macname, string scalar sep, scalar max) {
	string scalar strin, strout
	scalar j, strlength

	strin = st_local(macname)
	strlength = ceil(strlen(strin) / max)
	strout = ""
	for(j = 1; j < strlength; j++) {
		strout = strout + substr(strin, 1 + (j - 1) * max, max) + sep
	}
	strout = strout + substr(strin, 1 + (strlength - 1) * max, .)

	st_local(macname, strout)
}

end
	

Nick

On Thu, Sep 22, 2011 at 1:56 AM, Ozimek, Adam <Ozimek@econsult.com> wrote:
> How to correctly us the file command is not clear to me from the help file. Is there a longer online tutorial to be found? It is an unfortunately named command in that searching for "state file command" is not very helpful in google.
>
> There is more than one variable with a semi-colon in it, and so just replace all semi-colons with tabs will cause a bit of confusion, so I'm guessing I need to step the the file one line at a time and use the file command rather than filefilter.
>
>
> ________________________________________
> From: owner-statalist@hsphsun2.harvard.edu [owner-statalist@hsphsun2.harvard.edu] On Behalf Of Austin Nichols [austinnichols@gmail.com]
> Sent: Tuesday, September 20, 2011 10:21 AM
> To: statalist@hsphsun2.harvard.edu
> Subject: Re: st: Re: String variables over 244 in a dataset with two delimiters
>
> Joseph Coveney <jcoveney@bigplanet.com>:
> Good answer.  If some substrings delimited by semicolons are greater
> than 244 characters in lengths, and you want to keep all information,
> you can also use -file- to step through the file one line at a time
> and save bits of longer strings as separate variables, e.g. in 100
> character chunks.
>
> On Mon, Sep 19, 2011 at 11:40 PM, Joseph Coveney <jcoveney@bigplanet.com> wrote:
>> Adam Ozimek wrote:
>>
>> I have a dataset that is tab delimited, and one of the variables is a string
>> that can be over 244 characters. If I read this using insheet, or inputst, or I
>> think anything else, it truncates this variable. However, there is an aspect of
>> the string variable that I hope will let me get around this: it is delimited by
>> semicolon. Is there a way to select one of the columns in a tab delimited
>> dataset, and read in by parsing it as semi-colon delimited? Is there some
>> otherway to rescue the long variable without the truncation?
>>
>> --------------------------------------------------------------------------------
>>
>> There are a couple of ways to approach this problem, but I think that the most
>> direct is to use Stata's -filefilter- command to convert semicolons to
>> double-quote + tab + double-quotes, and then read the converted file in with
>> -insheet-.  (To learn more about-filefilter-, see Stata's online help for the
>> command or see its entry in the user manual.)
>>
>> Notes:
>>
>> 1. This assumes that your string column's contents are surrounded by
>> double-quotation marks.  If not, then just convert the semicolons to tabs alone.
>>
>> 2. If your tab-delimited file has a header row (column names), then remember to
>> insert a new name for your newly created column.  There are a couple of ways to
>> do that, too, in Stata, but again -filefilter- might be the most direct.
>>
>> 3. Don't overwrite your original.  (I'm not sure that -filefilter- will even
>> allow you to name <newfile> the same as <oldfile>, but if it does, don't do it.)
>>
>>
>> 4. The converted file can be a temporary file by using -tempfile- in conjunction
>> with -filefilter-.  This makes the project's intermediate-file-cleanup chores
>> easier.
>>
>> Joseph Coveney
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index