Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: Download and parse html files [and trouble with special characters]


From   "Gabi Huiber" <[email protected]>
To   [email protected]
Subject   Re: st: Download and parse html files [and trouble with special characters]
Date   Thu, 3 Apr 2008 01:55:38 -0400

I am doing something like that right now, except instead of html files
I am parsing a set of do-files saved at different points over the last
two years. I want to read each file line by line and write the lines
that start with a specific string to a text file next to the date when
that do-file was saved, to have a record of how this particular chunk
of code changed over time.

So this message has two parts: first I'll try to help out Sebastian.
Then I'll tell you where I ran aground.

1. I am doing this file reading and writing for the first time in
Stata (I normally use PHP for that, which I know is quite a
workaround, but that's another story). I find that the "file" section
of the Stata 10 manual (p.140 of the [P] book) has everything that
Sebastian needs. But one concrete suggestion would be this:

tempname fh_in fh_out
local myfilein "your html file here"
local myfileout "your text file here"
local linenum=0
file open `fh_in' using `myfilein', read
file open `fh_out' using `myfileout', write
file read `fh_in' line
while r(eof)==0 {
local linenum=`linenum'+1
if regexm("`macval(line)'","<your html tag of interest here>*</tag over>") {
local myline=regexs(number of subexpression of interest here, see URL below)
local len=length("`'myline")
di "`myline'"
file write `fh_out' %`len's "`myline'" _n
}
file read `fh_in' line
}
file close `fh_in'
file close `fh_out'

For details on subexpression numbers 0-9 see here:
http://www.stata.com/support/faqs/data/regex.html.

2. And here's what ails me. My do-files have some comment sections
like the one shown below:

   *Education Classification
   *------------------------
   *1. College Grad+ … where Master >150 and Bachelor >150
   *2. College … where Bachelor >105

If you move with the cursor over the ...'s above, you will see that
these are not three separate periods. They are some kind of
dot-dot-dot special character. This trips up the file read command,
look:

 … where Grade School >150" invalid name
r(198);

Does anybody know how to get Stata to run through such characters? I
remember a long time ago I had a similar problem with some
double-quotes that I cut and pasted into the do-file editor. They came
from MS Word and were pretty (like so: " ") and Stata snorted on them.
It wanted them plain (like so " "). At the time I just made a mental
note to always use a text editor for code, and that was that. But what
can I do now?

Thank you,
Gabi



On Thu, Apr 3, 2008 at 12:20 AM, Sebastian Bauhoff <[email protected]> wrote:
> Dear Statalisters,
>
> I need to download a large number of html files from the internet and parse their content.  The structure of the html pages is always the same, and I need to extract only a small part that is identified within the html code.  I would like to use Stata to download the files, extract the information I want, and save the result in a dataset.  Any suggestions or pointers much appreciated.
>
> Thanks,
> Sebastian
> *
> *   For searches and help try:
> *   http://www.stata.com/support/faqs/res/findit.html
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index