Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: Download and parse html files (and regex trouble)


From   "Gabi Huiber" <ghuiber@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Download and parse html files (and regex trouble)
Date   Thu, 3 Apr 2008 11:14:44 -0400

I figured it out. Double quotes came to the rescue. Here goes, for
documentation:

I am trying to get Stata to read a bunch of do-files line by line as
if they weren't do-files, but just ASCII gibberish.

For example, one file starts like this:

*THIS IS DO-FILE OF 03/08/2007


*	global pathlog 	"C:\myproject\log"		// log files
*	global pathsource "C:\myproject\source"		// sources

Now, say I am trying to just read this file line by line and output it
to the screen, for kicks, using the code below:

global bigroot "c:/myproject/"
global subsfrom "${bigroot}do/"
local snapshot "20070309"

tempvar fh_in
local myfilein "${subsfrom}subs_`snapshot'.do"

file open `fh_in' using `myfilein', read

local linenum=0
file read `fh_in' line
while r(eof)==0 {
local linenum=`linenum'+1
di _asis "`macval(line)'"
file read `fh_in' line
}

Doing so will produce this output:


*THIS IS DO-FILE OF 03/08/2007


*       global pathlog  C:\myproject\log"                      // log
files" invalid name
r(198);

end of do-file
r(198);

Where the colors go as follows:

*       global pathlog
      - is yellow, so all is well
C:\myproject\log"                      // log files" invalid name  - is all red

So it looks to me like Stata read everything up to the fourth line and
displayed it on screen as expected. Inside the fourth line it stopped
when it hit the first set of double quotes, after the word 'pathlog'.
Then it interpreted C:\myproject and everything after it, and it had
no idea what to make of that. The solution, as shown somewhere in the
manual and in earlier posts here, is to use double quotes. The di
_asis line should be

di _asis `"`macval(line)'"'

This happened well before the spot where the previously reported
dot-dot-dot character was first encountered. Double quotes took care
of that too, so the earlier problem went away.

But regarding that one: Dave's suggestion of opening the file with a
competent editor should go without saying. Emacs translated that
character as \205 and is showing it in red. Weirdly, hexdump didn't
catch it, or I am misreading the output:

hexdump `myfilein', analyze results

  Line-end characters                        Line length (tab=1)
    \r\n         (DOS)              1,806      minimum                        1
    \r by itself (Mac)                  0      maximum                      162
    \n by itself (Unix)                 0
  Space/separator characters                 Number of lines              1,806
    [blank]                         6,867      EOL at EOF?                  yes
    [tab]                           3,563
    [comma] (,)                       206    Length of first 5 lines
  Control characters                           Line 1                        32
    binary 0                            0      Line 2                         1
    CTL excl. \r, \n, \t                0      Line 3                         1
    DEL                                 0      Line 4                        53
    Extended (128-159,255)              9      Line 5                        54
  ASCII printable
    A-Z                             4,855
    a-z                            37,272    File format                 BINARY
    0-9                             3,935
    Special (!@#$ etc.)            12,221
    Extended (160-254)                  0
                          ---------------
  Total                            72,540

  Observed were:
     \t \n \r blank ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < =
     > ? A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ ` a b
     c d e f g h i j k l m n o p q r s t u v w x y z { | } ~ E^E

But all is well. Thank you, Austin and Dave, for your help.

Gabi

On Thu, Apr 3, 2008 at 9:45 AM, David Airey <david.airey@vanderbilt.edu> wrote:
> .
>
> I am often not able to copy and paste code from the list and have it run as
> the author intended. Opening such text in BBedit or other text editor that
> can see text gremlins often tells me I have invisible characters that are
> the problem.
>
> -Dave
>
>
>
> On Apr 3, 2008, at 7:23 AM, Austin Nichols wrote:
> > Gabi Huiber <ghuiber@gmail.com> and Sebastian Bauhoff
> <sbauhoff@gmail.com>:
> > I suspect Gabi has some extended ASCII characters in there causing the
> > trouble that one might not even be able to see, and will not appear in
> > plain text email to the list. Try analyzing the file with
> >
> > hexdump `file', analyze results
> >
> > and then use -filefilter- to remove or replace potentially problematic
> > characters before further processing.
> >
> > For Sebastian's problem, he may be able to do something like this:
> >
> > copy http://whatever.com/some.html  i.html, replace
> > insheet using i.html
> > g firstline=substr(v1,1,9)=="something"
> > g lastline=substr(v1,1,14)=="something else"
> > g keep=sum(firstline)-sum(lastline)
> > list if keep
> >
> > but -insheet- will also choke on extended ASCII characters, so
> > -hexdump- and -filefilter- may be required first.
> >
> > On Thu, Apr 3, 2008 at 3:06 AM, Gabi Huiber <ghuiber@gmail.com> wrote:
> >
> > > In an earlier response of mine to this post I blamed the ...
> > > (dot-dot-dot) special character for breaking my file read code. That
> > > was not the reason.
> > >
> > > The command file read `fh' line chokes on do-file lines where a
> > > comment is inserted before the end of the line with the double forward
> > > slash syntax. I have no idea how to make that go away. I tried
> > > enclosing my file read/file write routine within this if-condition:
> > >
> > > if !regexm("macval(`line')","[[a-zA-Z0-9][:punct:]]*\/\/"){
> > > read line in this file
> > > write line in that file
> > > }
> > >
> > > But that had no effect.
> > >
> > > Gabi
> > >
> > > On Thu, Apr 3, 2008 at 12:20 AM, Sebastian Bauhoff <sbauhoff@gmail.com>
> wrote:
> > >
> > > > Dear Statalisters,
> > > >
> > > > I need to download a large number of html files from the internet and
> parse
> > > > their content.  The structure of the html pages is always the same,
> and I
> > > > need to extract only a small part that is identified within the html
> code.
> > > > I would like to use Stata to download the files, extract the
> information I
> > > > want, and save the result in a dataset.  Any suggestions or pointers
> much
> > > > appreciated.
> > > >
> > > > Thanks,
> > > > Sebastian
> > > >
> > >
> > *
> > *   For searches and help try:
> > *   http://www.stata.com/support/faqs/res/findit.html
> > *   http://www.stata.com/support/statalist/faq
> > *   http://www.ats.ucla.edu/stat/stata/
> >
>
> --
> David C. Airey, Ph.D.
> Pharmacology Research Assistant Professor
> Center for Human Genetics Research Member
>
> Department of Pharmacology
> School of Medicine
> Vanderbilt University
> Rm 8158A Bldg MR3
> 465 21st Avenue South
> Nashville, TN 37232-8548
>
> TEL   (615) 936-1510
> FAX   (615) 936-3747
> EMAIL david.airey@vanderbilt.edu
> URL   http://people.vanderbilt.edu/~david.c.airey/dca_cv.pdf
> URL   http://www.vanderbilt.edu/pharmacology
>
>
>
>
>
> *
> *   For searches and help try:
> *   http://www.stata.com/support/faqs/res/findit.html
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index