Statalist The Stata Listserver


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: collecting raw data from the web via browser automation


From   Phil Schumm <pschumm@uchicago.edu>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: collecting raw data from the web via browser automation
Date   Mon, 22 May 2006 21:31:17 -0500

On May 22, 2006, at 5:52 PM, Kit Baum wrote:
On Mac OS X, either wget or curl will do what you want. I.e.

curl http://www.hsph.harvard.edu/cgi-bin/lwgate/STATALIST/archives/ statalist.0605/Date/article-780.html > austin.html

Perl is an excellent tool to grab web pages and turn them into text files (perhaps after stripping html tags). See a number of the scripts I have written in RePEc under software->RePEc team for examples (one, for instance, snarfs the AEA's XML data for the A.E.R. and turns it into RePEc templates).

To Kit's excellent answer, I would only add that Python is also a great tool for screen scraping. In fact, what you are proposing is a pretty common thing to do (I've done it occasionally myself, though not with search results from Google Scholar). Note also that Perl and Python (and probably also either wget or curl) come pre-installed under many OSes, and can also be easily installed under Windows.


-- Phil

*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/




© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index