Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: Download and parse html files (and regex trouble)


From   "Gabi Huiber" <ghuiber@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Download and parse html files (and regex trouble)
Date   Thu, 3 Apr 2008 03:06:28 -0400

In an earlier response of mine to this post I blamed the ...
(dot-dot-dot) special character for breaking my file read code. That
was not the reason.

The command file read `fh' line chokes on do-file lines where a
comment is inserted before the end of the line with the double forward
slash syntax. I have no idea how to make that go away. I tried
enclosing my file read/file write routine within this if-condition:

if !regexm("macval(`line')","[[a-zA-Z0-9][:punct:]]*\/\/"){
read line in this file
write line in that file
}

But that had no effect.

Gabi

On Thu, Apr 3, 2008 at 12:20 AM, Sebastian Bauhoff <sbauhoff@gmail.com> wrote:
> Dear Statalisters,
>
> I need to download a large number of html files from the internet and parse
> their content.  The structure of the html pages is always the same, and I
> need to extract only a small part that is identified within the html code.
> I would like to use Stata to download the files, extract the information I
> want, and save the result in a dataset.  Any suggestions or pointers much
> appreciated.
>
> Thanks,
> Sebastian
> *
> *   For searches and help try:
> *   http://www.stata.com/support/faqs/res/findit.html
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index