[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: insheet delimiter problem

From   Neil Shephard <>
Subject   Re: st: insheet delimiter problem
Date   Mon, 10 Nov 2008 12:45:25 +0000

Ada Ma wrote:
> Thanks for the reply.  Here is an example I have created which is
> close to what happened.  The data should look like this:
> epikey	hrg	        code1	code2	code3
> 1	        A0123	D100  	V123	        K166
> 2	        A0125	D200	        "	        G122
> 3	        B0101       D300    	"	        C333
> 4	        B0122	D400	        E002	        V777
> It is pipe delimited so in the text file it looks like this:
> epikey|hrg|code1|code2|code3
> 1|A0123|D100|V123|K166
> 2|A0125|D200|"|G122
> 3|B0101|D300|"|C333
> 4|B0122|D400|E002|V777
> When I specified the command as you stated above, i.e. specifying the
> delim("|") option, Stata reads in this:
> epikey	hrg	        code1	code2	                       code3
> 1	        A0123	D100  	V123	                               K166
> 2	        A0125	D200	        |G1223|B0101|D300|	       C333
> 4	        B0122	D400	        E002	                               V777
> So everything between the double quotes are treated as one string.  Is
> there any way to get around this without editing the txt file?
Hmm, that is problematic, and not quite what I'd expect, but I can see
clearly why its happening.  Stata sees the first double quote and
assumes that it is encapsulating a string variable, and reads until it
sees the next (closing) string variable, treating any pipes ("|") as
part of the string.

I'm not sure how to work around this in Stata I'm afraid.  You may gain
some mileage writing a custom dictionary and using -infile-.

Personally I would make a system call to the common *NIX-like command
'sed' to search and replace any instances of double-quotes.  This has
the advantage of being automated as the system call can be placed in
your do-file (as opposed to manually opening the file in your text
editor and doing the search and replace).  At the same time it has the
disadvantage of not being handled internally in Stata, making it
somewhat less platform neutral (would probably work fine on Linux and
Macs, but you'd have to have some trickery to call sed under a Cygwin
installation under Windows, I've done it in the past, but can't quote
remember the finer details).  There may be a similar command  (or indeed
native version of sed) under M$-windows Command Prompt, but I'm not
aware of it.

Another option would be to ask the people who sent you the data to
choose an alternative character/symbol/number for missing data (quite
why they chose double-quotes in the first place is a mystery only they
can answer as it has the potential mess things up, as you've found ,by
virtue of being the character used to encapsulate strings by many
databases and software).

Sorry I can't offer any more advise.


"We should make things as simple as possible, but not simpler" - Anon (not Albert Einstein)

This  message  may  contain  confidential and  privileged  information.
If you  are not the  intended recipient  you should not  disclose, copy
or distribute information in this e-mail or take any action in reliance
on its contents.  To do so is strictly  prohibited and may be unlawful.
Please  inform  the  sender that  this  message has  gone astray before
deleting it.  Thank you.

2008 marks the 60th anniversary of the NHS.  It's an opportunity to pay
tribute to the NHS staff and volunteers who help shape the service, and
celebrate their achievements.

If you work for the NHS  and  would like  an NHSmail  email account, go

*   For searches and help try:

© Copyright 1996–2021 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index