[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: insheet delimiter problem
"Ada Ma" <firstname.lastname@example.org>
Re: st: insheet delimiter problem
Mon, 10 Nov 2008 13:11:27 +0000
Thanks a lot to both for the solutions you have suggested. I think
the -filefilter- command will be the easiest to implement given that
I'm on a Windoze system!
On Mon, Nov 10, 2008 at 12:59 PM, Nick Cox <email@example.com> wrote:
> Utilities like sed are a good idea; as Neil says, they have been ported
> to Windows too (GNU project as well as the sources he cites).
> But check out -filefilter- in Stata.
> [D] filefilter . . . . . Convert ASCII text or binary patterns in
> a file
> (help filefilter)
> FAQ . . . . . . . . . . . . . . . . . . . . Malformed end-of-line
> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . J.
> 12/03 Why do I get rows of missing data when I use infile?
> SJ-8-2 pr0039 . Stata tip 60: Fast and easy changes to files with
> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A. R.
> Q2/08 SJ 8(2):290--292 (no
> tip on how to make changes to a file using the
> filefilter command
> I would pre-process the file so that double quotes were edited to
> something else. The character @ is often a good candidate.
> You can check with -hexdump- which characters are used in the file.
> The FAQ and Stata Tip give detailed examples.
> Neil Shephard
> Ada Ma wrote:
>> Thanks for the reply. Here is an example I have created which is
>> close to what happened. The data should look like this:
>> epikey hrg code1 code2 code3
>> 1 A0123 D100 V123 K166
>> 2 A0125 D200 " G122
>> 3 B0101 D300 " C333
>> 4 B0122 D400 E002 V777
>> It is pipe delimited so in the text file it looks like this:
>> When I specified the command as you stated above, i.e. specifying the
>> delim("|") option, Stata reads in this:
>> epikey hrg code1 code2
>> 1 A0123 D100 V123
>> 2 A0125 D200 |G1223|B0101|D300|
>> 4 B0122 D400 E002
>> So everything between the double quotes are treated as one string. Is
>> there any way to get around this without editing the txt file?
> Hmm, that is problematic, and not quite what I'd expect, but I can see
> clearly why its happening. Stata sees the first double quote and
> assumes that it is encapsulating a string variable, and reads until it
> sees the next (closing) string variable, treating any pipes ("|") as
> part of the string.
> I'm not sure how to work around this in Stata I'm afraid. You may gain
> some mileage writing a custom dictionary and using -infile-.
> Personally I would make a system call to the common *NIX-like command
> 'sed' to search and replace any instances of double-quotes. This has
> the advantage of being automated as the system call can be placed in
> your do-file (as opposed to manually opening the file in your text
> editor and doing the search and replace). At the same time it has the
> disadvantage of not being handled internally in Stata, making it
> somewhat less platform neutral (would probably work fine on Linux and
> Macs, but you'd have to have some trickery to call sed under a Cygwin
> installation under Windows, I've done it in the past, but can't quote
> remember the finer details). There may be a similar command (or indeed
> native version of sed) under M$-windows Command Prompt, but I'm not
> aware of it.
> Another option would be to ask the people who sent you the data to
> choose an alternative character/symbol/number for missing data (quite
> why they chose double-quotes in the first place is a mystery only they
> can answer as it has the potential mess things up, as you've found ,by
> virtue of being the character used to encapsulate strings by many
> databases and software).
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/statalist/faq
> * http://www.ats.ucla.edu/stat/stata/
Health Economics Research Unit
University of Aberdeen, UK.
Tel: +44 (0) 1224 553863
Fax: +44 (0) 1224 550926
* For searches and help try: