Statalist The Stata Listserver


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: collecting raw data from the web via browser automation


From   Kit Baum <baum@bc.edu>
To   statalist@hsphsun2.harvard.edu
Subject   st: collecting raw data from the web via browser automation
Date   Mon, 22 May 2006 18:52:00 -0400

Austin said

The trouble is this: the link to bibliographic data is not a static
page; it is generated on the fly, so Stata cannot -copy- to a local
file to -infile- the info. I will need a browser to browse to that
location, and then save the results. Does anyone have a freeware
solution to this problem? I have access to several varieties of
Windows and Unix/Linux, but no Mac OS options. What I am thinking is
that if there is a command line browser with the option to save the
page to disk, I can just invoke the page and save it with a single
line of code that begins with the -shell- command, and then infile it
with another that begins -infile-.

One thing to remember: if you can do it in Unix/Linux, you can always do it in Mac OS X, which is after all Unix with a Mac face.

On Mac OS X, either wget or curl will do what you want. I.e.

curl http://www.hsph.harvard.edu/cgi-bin/lwgate/STATALIST/archives/ statalist.0605/Date/article-780.html > austin.html

Perl is an excellent tool to grab web pages and turn them into text files (perhaps after stripping html tags). See a number of the scripts I have written in RePEc under software->RePEc team for examples (one, for instance, snarfs the AEA's XML data for the A.E.R. and turns it into RePEc templates).

Kit Baum, Boston College Economics
http://ideas.repec.org/e/pba1.html


*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/




© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index