Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: importing quirky csv

From   James Sams <>
Subject   Re: st: importing quirky csv
Date   Fri, 25 Nov 2011 11:03:31 -0600

On Thursday 24,  November  2011 08:51:16 you wrote:
> I have a large number of large comma-separated text files that I am
> trying to import. "insheet" is not working; it imports the data, but
> many lines are missing. I think the reason is the file contains string
> fields that a) have embedded spaces, and b) are not enclosed in
> quotes.

I've run across this and found that, unless you want to write your own csv 
parser (which is trickier than you might think), you will have to work outside 
of Stata. That said, it is easy to automate from a do file. I've found Python's 
csv parser to be quite robust and able to write out the csv files in such a way 
that Stata will happily read them. The approach I took was to just parse the 
entire directory of csv's and then import those into Stata. However, let's say 
you wanted to make a script that you call for each file from within Stata, then 
the python code should look like this (assuming python 2.7 and actually commas 
as the separators. Note that whitespace is very important in python):

#!/usr/bin/env python
# make files readable for stata

import sys
import csv


def reprocess(in_fn, out_fn):
    with open(in_fn, 'rb') as in_fd:
        with open(out_fn, 'wb') as out_fd:
            reader = csv.reader(in_fd, delimiter=DELIMITER)
            writer = csv.writer(out_fd, delimiter=DELIMITER)

if __name__ == "__main__":
    reprocess(sys.argv[1], sys.argv[2])

and then in stata:

local my_original_file "bad.csv"
tempfile good
! python `my_original_file' `good'
insheet using `good', comma

I did write this on the fly, so there may be typos that I didn't catch, but it 
is based on code I've used previously that works reliably.

James Sams
*   For searches and help try:

© Copyright 1996–2016 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index