Statalist The Stata Listserver

[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: Re: collecting raw data from the web via browser automation

From   Kit Baum <>
Subject   st: Re: collecting raw data from the web via browser automation
Date   Tue, 23 May 2006 07:04:43 -0400

As a later post indicates, you can use Perl's LWP module for this, or as Phil suggests, Python. But when it comes down to it Michael's suggestion below is far more useful:

--cut here--
capt program drop _all
program goograb,rclass
syntax ,Name(string)
local name : subinstr local name " " "+",all
local url " q=`name'&ie=UTF-8&oe=UTF-8&hl=en&btnG=Search"
copy "`url'" test.html, text replace
-- cut here--

goograb, name(blasnik michael)

returns test.html (hardcoded out of laziness; could use a tempfile and then use file commands to snarf it and work with the contents).
Give -goograb- any other name and it will look for their stuff in Google Scholar.

Kit Baum, Boston College Economics

On May 23, 2006, at 2:33 AM, Michael wrote:

I'm not sure if any of these tools can actually solve the problem originally

The example Kit gives shows accessing a static web page -- a page that
already exists "as is" and one you could also simply copy to your local
drive using Stata itself (copy http:/.../...) and then parse it as needed.
It's easy to download that data directly to Stata and I don't think that is
the problem.

I think what the original post asked for (and what I would be interested in
as well) is a way to access web pages that are only created when an action
is taken or selection is made on a different web page, so there is no
specific web address that holds the data you want. I have thought about
trying to use auto-it or another scripting language to launch a browser,
make selections on a web page and then capture the data that's spawned
typically in a new window.

Do any of the tools mentioned by Kit or Phil actually do this?
*   For searches and help try:

© Copyright 1996–2015 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index