[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: collecting raw data from the web via browser automation

From	Phil Schumm <[email protected]>
To	[email protected]
Subject	Re: st: collecting raw data from the web via browser automation
Date	Mon, 22 May 2006 21:31:17 -0500

On May 22, 2006, at 5:52 PM, Kit Baum wrote:

On Mac OS X, either wget or curl will do what you want. I.e.

curl http://www.hsph.harvard.edu/cgi-bin/lwgate/STATALIST/archives/ statalist.0605/Date/article-780.html > austin.html

Perl is an excellent tool to grab web pages and turn them into text files (perhaps after stripping html tags). See a number of the scripts I have written in RePEc under software->RePEc team for examples (one, for instance, snarfs the AEA's XML data for the A.E.R. and turns it into RePEc templates).

To Kit's excellent answer, I would only add that Python is also a great tool for screen scraping. In fact, what you are proposing is a pretty common thing to do (I've done it occasionally myself, though not with search results from Google Scholar). Note also that Perl and Python (and probably also either wget or curl) come pre-installed under many OSes, and can also be easily installed under Windows.

-- Phil

*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: collecting raw data from the web via browser automation
  - From: "Michael Blasnik" <[email protected]>

References:
- st: collecting raw data from the web via browser automation
  - From: Kit Baum <[email protected]>

Prev by Date: Re: st: Two wishes
Next by Date: st: Memory error with outreg2
Previous by thread: st: collecting raw data from the web via browser automation
Next by thread: Re: st: collecting raw data from the web via browser automation
Index(es):
- Date
- Thread