Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: question : linking results of web queries to data

From   Eric Booth <[email protected]>
To   "<[email protected]>" <[email protected]>
Subject   Re: st: question : linking results of web queries to data
Date   Fri, 17 Dec 2010 19:50:33 +0000


It looks like Laszlo is asking to grab the "short id" from this query tool & merge it to a list of names or other information he already has in his database.  I'll leave the IRB issues that Peter mentions up to Laszlo -- I assume he's just compiling a database of information from this publicly available website (though, if this is true, one approach you could always try is requesting this data from the website owner) 

I don't know how you could use this cgi search via Stata (and Stata may not be the best tool for this), but there are a couple of options for using Stata to get the elements you need from these webpages (though, since I don't know exactly what information you need, I don't know which of these is best):

(1) if all you need is the "short id"s from this cgi query tool, you could just search 26 times for each letter "a", "b", ... and then copy and paste the list of short ids to a local file

(2) you can get the same list of all the short id's linked to the authors' pages from this listing -- this avoids you having to use the cgi query tool repeatedly:

(3)  you can get the list of all the authors' webpages (which includes their short id) from:

For this option, you can automate extracting the information you need from these pages by using -copy- to get the file to your machine,

copy ""; "index.txt", replace public

**you need -intext- from SSC**
cap which intext
if _rc ssc install intext, replace

**be patient, this can take a while-->
intext using "index.txt", g(v) length(100)
split v1, p(`"href=""')
split v12, p(`".html"')

**v121 should contain the short IDs of interest**
ds v121, not
drop `r(varlist)'
drop if mi(v121)

**get rid of extra cells with html tags**
foreach v in "<" ">" "/" {
	cap drop if index(v121, "`v'")
**now you've got a list of all the shortid's**
levelsof v121, loc(shortid)
foreach v in `shortid' {
	copy "`v'.html" "`v'.txt", replace public
	*< use -intext- and -split- to get the fields you need and clean them up>*
I'll leave the last steps up to you, but you should be able to follow the same process I used to get the list of short id's, and instead extract other fields from the authors' HTML pages (e.g., their firstname, lastname, webpage, email, citations, affiliations, etc).   Use -split- and other string functions (see -help string_functions-) to clean up your records.  Once you clean up each author's page, you can append them all together and then merge the appended file to your main dataset via the "short id."

- Eric
Eric A. Booth
Public Policy Research Institute
Texas A&M University
[email protected]
Office: +979.845.6754

On Dec 17, 2010, at 10:50 AM, László Sándor wrote:

> Hi all,
> I need to query a website for some extra data that I would link to my
> existing one. I am using Stata 11.1 on Mac and Unix.
> My data has names of people, and I should query a site using CGI
> ( and collect a single
> string from the resulting pages into a new variable.
> I don't know enough about Perl (etc.) to simply write the right
> script, run it with -shell-, and get the data that way. I would
> appreciate any guidance (tools, examples) on how this could be done, I
> have not found this functionality in (and 'around') Stata so far.
> Thank you,
> Laszlo
> László Sándor
> PhD candidate in Economics
> Harvard University
> *
> *   For searches and help try:
> *
> *
> *

*   For searches and help try:

© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index