Statalist The Stata Listserver

[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: import html

From   Alan Riley <>
Subject   Re: st: import html
Date   Wed, 14 Mar 2007 10:36:33 -0500

Rodrigo Martell ( is trying
to use Stata to process an HTML file to retrieve the links in it:
> I'm trying to get the list of links in this website ( as a list in a text file.
> I tried using -copy- to save it a text file but naturally, it saves it as messy html and importing it into Stata fails because the strings are too long.
> I tried -insheet- with some delimiters but it doesn't seem to like this giving a r(198) error.
> I found some html related programs using -findit- but none seem to relate to reading in html, just spitting out html.
> Does anyone know a clever way to do this in Stata? I'm running out of ideas, I might have to look into writing a console application that does it so that my program can shell out to it when it needs to fetch the list of links as a text file.

There are multiple ways to approach this in Stata.  Since Rodrigo has
a specific HTML file in mind with a specific format, it will be easier
to solve this problem than to try to write a general-purpose HTML
link finder (I would probably use Mata for that since it can handle
strings of any length).

The biggest problem with the file at the URL Rodrigo provided is
that most of it is in one very long line which is difficult for
Stata to manage.  The -filefilter- command, however, can be used
to pass through the file and give us some line breaks.  Then we
need to read each line into Stata, keep those with links, process
out extraneous characters on the line, and we should be left
with all the links in the file.

I include commented code below to do just that for this particular

version 9
set more off
drop _all

local toget "";
tempfile tmpone
tempfile tmptwo

// Get a copy of the file at the URL
copy "`toget'" "`tmpone'", replace

// Run the downloaded file through -filefilter-, changing every
// greater-than character in it to a greater-than character followed
// by a newline character.  This effectively makes sure that every
// HTML tag in the file is on a line by itself.
filefilter "`tmpone'" "`tmptwo'", from(">") to(">\n")

// We are lucky that with this specific URL, after adding the
// newlines above, no single line is greater than the limit
// on string length in Stata.  So, we can read each line as
// the observation of a string variable.
infix str line 1-244 using "`tmptwo'"

// Look for lines containing   <a href="  since this will be
// the lines with links.  Keep only those lines, then get rid
// of anything on the line up to and including  <a href="
// so that we are left with the link itself followed by the
// close of the 'a href' tag.
gen hrefpos = strpos(line,`"<a href=""')
keep if hrefpos > 0
replace hrefpos = hrefpos + length(`"<a href=""') 
replace line = substr(line,hrefpos,.)
drop hrefpos

// Get rid of the close of the 'a href' tag.
gen quotepos = strpos(line, `"">"')
replace line = substr(line,1,quotepos-1)
drop quotepos

// The specific URL we downloaded happened to have a link to
// its parent directory.  That likely isn't wanted, so drop it.
drop if line=="../"

// We are done: if we -list line- we should see the links that
// were in the URL.  Note that the code above does not use Stata's
// regular  expression functions as many people are not familiar
// with regular expressions, but they would have been another
// powerful and flexible way to approach this problem.


*   For searches and help try:

© Copyright 1996–2020 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index