Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: Download and parse html files (and regex trouble)


From   "Austin Nichols" <austinnichols@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Download and parse html files (and regex trouble)
Date   Thu, 3 Apr 2008 08:23:30 -0400

Gabi Huiber <ghuiber@gmail.com> and Sebastian Bauhoff <sbauhoff@gmail.com>:
I suspect Gabi has some extended ASCII characters in there causing the
trouble that one might not even be able to see, and will not appear in
plain text email to the list. Try analyzing the file with

 hexdump `file', analyze results

and then use -filefilter- to remove or replace potentially problematic
characters before further processing.

For Sebastian's problem, he may be able to do something like this:

 copy http://whatever.com/some.html  i.html, replace
 insheet using i.html
 g firstline=substr(v1,1,9)=="something"
 g lastline=substr(v1,1,14)=="something else"
 g keep=sum(firstline)-sum(lastline)
 list if keep

but -insheet- will also choke on extended ASCII characters, so
-hexdump- and -filefilter- may be required first.

On Thu, Apr 3, 2008 at 3:06 AM, Gabi Huiber <ghuiber@gmail.com> wrote:
> In an earlier response of mine to this post I blamed the ...
>  (dot-dot-dot) special character for breaking my file read code. That
>  was not the reason.
>
>  The command file read `fh' line chokes on do-file lines where a
>  comment is inserted before the end of the line with the double forward
>  slash syntax. I have no idea how to make that go away. I tried
>  enclosing my file read/file write routine within this if-condition:
>
>  if !regexm("macval(`line')","[[a-zA-Z0-9][:punct:]]*\/\/"){
>  read line in this file
>  write line in that file
>  }
>
>  But that had no effect.
>
>  Gabi
>
>  On Thu, Apr 3, 2008 at 12:20 AM, Sebastian Bauhoff <sbauhoff@gmail.com> wrote:
>  > Dear Statalisters,
>  >
>  > I need to download a large number of html files from the internet and parse
>  > their content.  The structure of the html pages is always the same, and I
>  > need to extract only a small part that is identified within the html code.
>  > I would like to use Stata to download the files, extract the information I
>  > want, and save the result in a dataset.  Any suggestions or pointers much
>  > appreciated.
>  >
>  > Thanks,
>  > Sebastian
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index