[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: collecting raw data from the web via browser automation

From	Phil Schumm <[email protected]>
To	[email protected]
Subject	Re: st: collecting raw data from the web via browser automation
Date	Tue, 23 May 2006 08:39:03 -0500

On May 22, 2006, at 10:54 PM, Michael Blasnik wrote:

I'm not sure if any of these tools can actually solve the problem originally posted.

Yes, they can. Both curl and wget support authentication, cookies, SSL, and the use of HTTP POST (in addition to GET) to submit a request. And with either Python or Perl, you can script an entire web session, including passing through multiple forms, with each subsequent request dependent on the result(s) returned from the last.

As a later post indicated, you can use Stata's -copy- to retrieve a page using GET (i.e., parameters encoded in the actual URL), and in this way initiate a search with Google Scholar. However, the URL in the original posting resulted from clicking on the "Import into..." link corresponding to a single item from the list of items returned by a search. I'm not sure how this selection would be made programmatically, or, if the intention was to grab the information on all of the top n items (note that depending upon how large n is, this might be spread across multiple results pages, due to the way results are batched). Moreover, the format of the data returned by the original URL depends upon how your "Scholar Preferences" are set (i.e., which bibliographic format), and these preferences are probably stored in a cookie. Finally, regardless of the export format chosen, you may still need to do some post-processing before reading the "data" into Stata. Thus, even though the initial search can be triggered with -copy-, one of the other suggested tools may well be necessary to complete the entire task (or at least to do so in an efficient way).

On May 22, 2006, at 4:21 PM, Austin Nichols wrote:

Google Scholar has a nice way to set Preferences so that links to bibliographic info are generated in the search results, but I don't use BibTeX or EndNote or any of those things--I use Stata, and I want to automate the whole process of seaching and saving those data (which look like http://scholar.google.com/scholar.bib? q=info:nmXVGJVxYjQJ:scholar.google.com/&output=citation by the way) and infiling them into Stata so I can have a nice database of articles made for me on any set of search terms I put in.

I meant to comment on this before, but forgot. As much as I love to see new uses for Stata, I would strongly urge you to look at one of the free programs available for managing BibTex files (e.g., tkbibtex, BibTool, or Bibcursed (multi-platform), BibDesk (my personal favorite; OS X only), or BibEdit or BibDB (Windows only)). Also, many text editors provide tools for working with files in BibTeX format. As you know, you can export directly into BibTeX format from Scholar. Even if you don't actually use BibTeX when writing, these tools may permit you to accomplish what you need, and may suggest other things you hadn't thought of.

-- Phil

*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/

References:
- st: collecting raw data from the web via browser automation
  - From: Kit Baum <[email protected]>
- Re: st: collecting raw data from the web via browser automation
  - From: Phil Schumm <[email protected]>
- Re: st: collecting raw data from the web via browser automation
  - From: "Michael Blasnik" <[email protected]>

Prev by Date: RE: st: Memory error with outreg2
Next by Date: st: types of standard error
Previous by thread: Re: st: collecting raw data from the web via browser automation
Next by thread: st: efficiency (or lack thereof) of the colon operator
Index(es):
- Date
- Thread