Statalist The Stata Listserver


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: collecting raw data from the web via browser automation


From   "Michael Blasnik" <michael.blasnik@verizon.net>
To   <statalist@hsphsun2.harvard.edu>
Subject   Re: st: collecting raw data from the web via browser automation
Date   Mon, 22 May 2006 23:54:04 -0400

I'm not sure if any of these tools can actually solve the problem originally posted.

The example Kit gives shows accessing a static web page -- a page that already exists "as is" and one you could also simply copy to your local drive using Stata itself (copy http:/.../...) and then parse it as needed. It's easy to download that data directly to Stata and I don't think that is the problem.

I think what the original post asked for (and what I would be interested in as well) is a way to access web pages that are only created when an action is taken or selection is made on a different web page, so there is no specific web address that holds the data you want. I have thought about trying to use auto-it or another scripting language to launch a browser, make selections on a web page and then capture the data that's spawned typically in a new window.

Do any of the tools mentioned by Kit or Phil actually do this?

Michael Blasnik
michael.blasnik@verizon.net


----- Original Message ----- From: "Phil Schumm" <pschumm@uchicago.edu>
To: <statalist@hsphsun2.harvard.edu>
Sent: Monday, May 22, 2006 10:31 PM
Subject: Re: st: collecting raw data from the web via browser automation



On May 22, 2006, at 5:52 PM, Kit Baum wrote:
On Mac OS X, either wget or curl will do what you want. I.e.

curl http://www.hsph.harvard.edu/cgi-bin/lwgate/STATALIST/archives/ statalist.0605/Date/article-780.html > austin.html

Perl is an excellent tool to grab web pages and turn them into text files (perhaps after stripping html tags). See a number of the scripts I have written in RePEc under software->RePEc team for examples (one, for instance, snarfs the AEA's XML data for the A.E.R. and turns it into RePEc templates).

To Kit's excellent answer, I would only add that Python is also a great tool for screen scraping. In fact, what you are proposing is a pretty common thing to do (I've done it occasionally myself, though not with search results from Google Scholar). Note also that Perl and Python (and probably also either wget or curl) come pre-installed under many OSes, and can also be easily installed under Windows.


-- Phil
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index