Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down at the end of May, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Using the -copy- command to download google ngram data


From   Austin Nichols <austinnichols@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Using the -copy- command to download google ngram data
Date   Wed, 14 Dec 2011 13:36:05 -0500

Paul Madsen <paul.madsen@warrington.ufl.edu>:
You can write or acquire a web crawler that will save all zip files
that resolve from a given URL and shell out to that crawler, but such
a solution is not done within Stata and may be OS dependent.  See e.g.
http://andreas-hess.info/programming/webcrawler/index.html
http://en.wikipedia.org/wiki/Wget

Note that "windows 64 bit" encompasses many disparate OS options.

On Wed, Dec 14, 2011 at 10:53 AM, Madsen,Paul
<paul.madsen@warrington.ufl.edu> wrote:
> Dear Statalist,
>
> I would like to download google's ngram data using stata's -copy- command. The data are located here: http://books.google.com/ngrams/datasets.
>
> I'm running Stata/SE 11.2 for windows 64 bit.
>
> Here's the relevant line of Stata code, which is intended to copy the zip file to a local directory and name it download.zip:
>
> copy http://commondatastorage.googleapis.com/books/ngrams/books/googlebooks-eng-us-all-1gram-20090715-0.csv.zip download.zip
>
> The web address in the code was taken from the google ngram website (by right clicking the link to the file and pasting it in stata).
>
> When I run this code, I get the error:
>
> file http://commondatastorage.googleapis.com/books/ngrams/books/googlebooks-eng-us-all-1gram-20090715-0.csv.zip not found
> server says file temporarily redirected to http://v5.lscache6.c.bigcache.googleapis.com/books/ngrams/books/googlebooks-eng-us-all-1gram-20090715-0.csv.zip
>
> This looks like an issue on google's end. If I copy the new file location from the error text and run the stata code:
>
> copy http://v5.lscache6.c.bigcache.googleapis.com/books/ngrams/books/googlebooks-eng-us-all-1gram-20090715-0.csv.zip download.zip
>
> I get the error message "unexpected end of file." This problem is not isolated to the specific google ngram file in the example code. I've tried it on several of them with the same problem. I have also tested the code on a different zip file from a different website and the code works well when it is used on another dataset.
>
> It is hard for me to believe that google's files would have some fundamental flaw that makes download directly to Stata impossible. Can something be done in Stata to deal with such a problem (maybe using the shell command)?
>
> Thanks!
>
> Paul E. Madsen
> University of Florida

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index