Statalist The Stata Listserver


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

RE: st: import html


From   "Rodrigo Martell" <rodrigo.martell@frontier-economics.com.au>
To   <statalist@hsphsun2.harvard.edu>
Subject   RE: st: import html
Date   Thu, 15 Mar 2007 09:28:55 +1100

Alan,

Thanks! That is fantastic! For some time I had been approaching this problem (I have lots of links with the same problem but different files listed in different formats) by writting console apps to try to do it. Doing it all in Stata is much easier and elegant. 

Thanks again,

Rodrigo 

Rodrigo Martell

 <http://www.frontier-economics.com> 	
Frontier Economics Pty. Ltd.
395 Collins Street
Melbourne VIC 3000
Australia
www.frontier-economics.com 	
switch:
direct:
fax:
mobile:
email:


+61 (0)3 9620 4488
+61 (0)3 9613 1518
+61 (0)3 8614 2711
+61 (0)407 909 811
rodrigo.martell@frontier-economics.com <mailto:rodrigo.martell@frontier-economics.com> 


This e-mail, including any attachments, may contain confidential and privileged information for the sole use of the intended recipient(s). Any review, use, disclosure or distribution by others is strictly prohibited. If you are not the intended recipient (or authorised to receive information for the recipient), please contact the sender by reply e-mail and delete all copies of this message. Thank you.

	



-----Original Message-----
From: owner-statalist@hsphsun2.harvard.edu
[mailto:owner-statalist@hsphsun2.harvard.edu]On Behalf Of Alan Riley
Sent: Thursday, 15 March 2007 2:37 AM
To: statalist@hsphsun2.harvard.edu
Subject: Re: st: import html


Rodrigo Martell (rodrigo.martell@frontier-economics.com) is trying
to use Stata to process an HTML file to retrieve the links in it:
> I'm trying to get the list of links in this website (http://www.nemweb.com.au/Reports/CURRENT/DispatchIS_Reports/) as a list in a text file.
> I tried using -copy- to save it a text file but naturally, it saves it as messy html and importing it into Stata fails because the strings are too long.
> I tried -insheet- with some delimiters but it doesn't seem to like this giving a r(198) error.
> I found some html related programs using -findit- but none seem to relate to reading in html, just spitting out html.
> 
> Does anyone know a clever way to do this in Stata? I'm running out of ideas, I might have to look into writing a console application that does it so that my program can shell out to it when it needs to fetch the list of links as a text file.


There are multiple ways to approach this in Stata.  Since Rodrigo has
a specific HTML file in mind with a specific format, it will be easier
to solve this problem than to try to write a general-purpose HTML
link finder (I would probably use Mata for that since it can handle
strings of any length).

The biggest problem with the file at the URL Rodrigo provided is
that most of it is in one very long line which is difficult for
Stata to manage.  The -filefilter- command, however, can be used
to pass through the file and give us some line breaks.  Then we
need to read each line into Stata, keep those with links, process
out extraneous characters on the line, and we should be left
with all the links in the file.

I include commented code below to do just that for this particular
URL:

-------------------------------------------------------------------
version 9
set more off
drop _all

local toget "http://www.nemweb.com.au/Reports/CURRENT/DispatchIS_Reports/";
tempfile tmpone
tempfile tmptwo

// Get a copy of the file at the URL
copy "`toget'" "`tmpone'", replace

// Run the downloaded file through -filefilter-, changing every
// greater-than character in it to a greater-than character followed
// by a newline character.  This effectively makes sure that every
// HTML tag in the file is on a line by itself.
filefilter "`tmpone'" "`tmptwo'", from(">") to(">\n")

// We are lucky that with this specific URL, after adding the
// newlines above, no single line is greater than the limit
// on string length in Stata.  So, we can read each line as
// the observation of a string variable.
infix str line 1-244 using "`tmptwo'"

// Look for lines containing   <a href="  since this will be
// the lines with links.  Keep only those lines, then get rid
// of anything on the line up to and including  <a href="
// so that we are left with the link itself followed by the
// close of the 'a href' tag.
gen hrefpos = strpos(line,`"<a href=""')
keep if hrefpos > 0
replace hrefpos = hrefpos + length(`"<a href=""') 
replace line = substr(line,hrefpos,.)
drop hrefpos

// Get rid of the close of the 'a href' tag.
gen quotepos = strpos(line, `"">"')
replace line = substr(line,1,quotepos-1)
drop quotepos

// The specific URL we downloaded happened to have a link to
// its parent directory.  That likely isn't wanted, so drop it.
drop if line=="../"

// We are done: if we -list line- we should see the links that
// were in the URL.  Note that the code above does not use Stata's
// regular  expression functions as many people are not familiar
// with regular expressions, but they would have been another
// powerful and flexible way to approach this problem.

-------------------------------------------------------------------


Alan
(ariley@stata.com)
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index