Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: skipping rogue commas when importing csv file using -infile-


From   Rob Shaw <rob.shaw.uk@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: skipping rogue commas when importing csv file using -infile-
Date   Mon, 29 Oct 2012 15:31:02 +0000

Nick,

Thanks, this worked. The rows were >244 but I just imported them as
multiple variables and then summed your gen nocommas line for all of
them (for any more than 4 I would have used a foreach varlist loop)

For reference for others:

infix str var1 1-200 str var2 200-400 str var3 401-600 str var4
601-800 using testfile.csv
gen numcommas = length(var1) - length(subinstr(var1, ",", "", .)) +
length(var2) - length(subinstr(var2, ",", "", .)) + length(var3) -
length(subinstr(var3, ",", "", .)) + length(var4) -
length(subinstr(var4, ",", "", .))
keep numcommas
list if numcommas!=59

This --list-- command then gives me the lines to skip when I use --infile--.

Many thanks again
Rob

Nick Cox wrote:

I'd be tempted to read the whole thing in as one string variable and
process it within Stata.

I realise that there are limits on this, in terms of both storage
required and whether the beast will fit into str244. (But Mata may
help on the latter.)

If you can do that, the number of commas is just

gen nocommas = length(strvar) - length(subinstr(strvar, ",", "", .))

Nick

On Fri, Oct 26, 2012 at 4:59 PM, Rob Shaw <rob.shaw.uk@gmail.com> wrote:
> Hi
>
> I'm importing (part of) a large text file into Stata using --infile--.
> The file is a csv.
>
> However, it seems that a small number of lines have a rouge extra
> comma in them, which is then pushing all the data along by one
> variable. This happens not just for that line but for all subsqequent
> lines as well!
>
> I'm not too bothered if I have to later drop or reprocess this
> individual line but does anyone know if there is there a way to stop
> it affecting all the lines afterwards as well?
>
> File example (with identical records in this example)
>
> ABC,DEF,GH,IJK
> ABC,DEF,GH,IJK
> ABC,DEF,G,H,IJK
> ABC,DEF,GH,IJK
> ABC,DEF,GH,IJK
>
> What I then get is for var1 is
>
> ABC
> ABC
> ABC
> IJK
> IJK
>
> and var2 is
>
> DEF
> DEF
> DEF
> ABC
> ABC
>
> etc
>
> using hexdump it seems that all the lines finish with \r\n so if there
> is a way to use this to 'reset' at each line then that would work.
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index