Thanks a lot to both for the solutions you have suggested.  I think
the -filefilter- command will be the easiest to implement given that
I'm on a Windoze system!
Regards,
Ada
On Mon, Nov 10, 2008 at 12:59 PM, Nick Cox <[email protected]> wrote:
> Utilities like sed are a good idea; as Neil says, they have been ported
> to Windows too (GNU project as well as the sources he cites).
>
> But check out -filefilter- in Stata.
>
> [D]     filefilter  . . . . .  Convert ASCII text or binary patterns in
> a file
>        (help filefilter)
>
> FAQ     . . . . . . . . . . . . . . . . . . . . Malformed end-of-line
> sequence
>        . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . J.
> Hassell
>        12/03   Why do I get rows of missing data when I use infile?
>                http://www.stata.com/support/faqs/data/miss_data.html
>
> SJ-8-2  pr0039  . Stata tip 60: Fast and easy changes to files with
> filefilter
>        . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  A. R.
> Riley
>        Q2/08   SJ 8(2):290--292                                 (no
> commands)
>        tip on how to make changes to a file using the
>        filefilter command
>
>
> I would pre-process the file so that double quotes were edited to
> something else. The character @ is often a good candidate.
>
> You can check with -hexdump- which characters are used in the file.
>
> The FAQ and Stata Tip give detailed examples.
>
> Nick
> [email protected]
>
> Neil Shephard
>
> Ada Ma wrote:
>> Thanks for the reply.  Here is an example I have created which is
>> close to what happened.  The data should look like this:
>>
>> epikey        hrg             code1   code2   code3
>> 1             A0123   D100    V123            K166
>> 2             A0125   D200            "               G122
>> 3             B0101       D300        "               C333
>> 4             B0122   D400            E002            V777
>>
>> It is pipe delimited so in the text file it looks like this:
>>
>> epikey|hrg|code1|code2|code3
>> 1|A0123|D100|V123|K166
>> 2|A0125|D200|"|G122
>> 3|B0101|D300|"|C333
>> 4|B0122|D400|E002|V777
>>
>> When I specified the command as you stated above, i.e. specifying the
>> delim("|") option, Stata reads in this:
>>
>> epikey        hrg             code1   code2
> code3
>> 1             A0123   D100    V123
> K166
>> 2             A0125   D200            |G1223|B0101|D300|
> C333
>> 4             B0122   D400            E002
> V777
>>
>> So everything between the double quotes are treated as one string.  Is
>> there any way to get around this without editing the txt file?
>>
>>
> Hmm, that is problematic, and not quite what I'd expect, but I can see
> clearly why its happening.  Stata sees the first double quote and
> assumes that it is encapsulating a string variable, and reads until it
> sees the next (closing) string variable, treating any pipes ("|") as
> part of the string.
>
> I'm not sure how to work around this in Stata I'm afraid.  You may gain
> some mileage writing a custom dictionary and using -infile-.
>
> Personally I would make a system call to the common *NIX-like command
> 'sed' to search and replace any instances of double-quotes.  This has
> the advantage of being automated as the system call can be placed in
> your do-file (as opposed to manually opening the file in your text
> editor and doing the search and replace).  At the same time it has the
> disadvantage of not being handled internally in Stata, making it
> somewhat less platform neutral (would probably work fine on Linux and
> Macs, but you'd have to have some trickery to call sed under a Cygwin
> installation under Windows, I've done it in the past, but can't quote
> remember the finer details).  There may be a similar command  (or indeed
> native version of sed) under M$-windows Command Prompt, but I'm not
> aware of it.
>
> Another option would be to ask the people who sent you the data to
> choose an alternative character/symbol/number for missing data (quite
> why they chose double-quotes in the first place is a mystery only they
> can answer as it has the potential mess things up, as you've found ,by
> virtue of being the character used to encapsulate strings by many
> databases and software).
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>
-- 
Ada Ma
Research Fellow
Health Economics Research Unit
University of Aberdeen, UK.
http://www.abdn.ac.uk/heru/
Tel: +44 (0) 1224 553863
Fax: +44 (0) 1224 550926
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/