Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Read HTML file with Stata


From   Friedrich Huebler <fhuebler@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Read HTML file with Stata
Date   Tue, 14 Jul 2009 09:42:52 -0400

Eric,

Thank you for the instructive example. The files I am trying to read
were created with SAS using PROC REPORT and they consist of rows and
columns with text and numbers. The HTML code contains formatting
information, for example for the column width and borders. The first
lines of all files are the same:

<html xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns="http://www.w3.org/TR/REC-html40";>
<head>
<meta http-equiv=Content-Type content="text/html; charset=windows-1252">
<meta name=ProgId content=Excel.Sheet>
<meta name=Generator content="Microsoft Excel 11">

Additional lines specify the page size, margins, font, and so on. A
typical data entry looks like this:

<tr height=17 style='height:12.75pt'>
<td class=xl24 width=70
style='border-top:none;border-left:none;width:53pt' x:num>191</td>
<td class=xl24 width=87
style='border-top:none;border-left:none;width:65pt' x:num>104</td>
</tr>

I will see what I can do with -filefilter-.

Friedrich

On Tue, Jul 14, 2009 at 12:26 AM, Eric A. Booth<ebooth@ppri.tamu.edu> wrote:
> Hello Friedrich:
>
> It might help if we get a bit more detail on what your HTML file looks like
> (e.g., is it a webpage with embedded tables of some sort; does it just
> contain data that is delimited or broken into sections with tags; etc.)--I
> think it would be difficult to create a add-on or program that simply
> converts any HTML file to .XLS.
>
> It's probably not the most elegant solution, but I was automating the
> extraction of some text from a table embedded in a webpage by using
> -filefilter- and some string functions to get rid of the HTML tags, etc.  --
> leaving only the data I needed.  I then used -intext- to bring the
> information into Stata before cleaning it up some more.
>
> For a rough example:
> Say I wanted to extract the elements from the small table embedded in this
> Stata webpage:   http://www.stata.com/support/updates/stata9.html .
>
> Given that there is a lot of extra information on the page, I would need to
> look at the source code of the HTML web page and get rid of the extra
> information & get at the elements in that small embedded table.
>  (Friedrich's issue may be much simpler in that his HTML file might only be
> formatted data with a couple of HTML tags/strings at the top & bottom of the
> file that can be filtered out using -filefilter-)
>
> Here's how I might extract the elements from this Stata web page table into
> a .dta file  (I would appreciate any other users' comments on how to make
> this process easier or more efficient) :
>
> **********************
> clear
> local sf "`pwd'"
>
>
> **get some text / information from any webpage using !curl (in linux/unix)
> or -copy- **
> copy http://www.stata.com/support/updates/stata9.html  "`sf'test.txt",
> replace
> **alternate**  !curl -o --anyauth --ignore-content-length  "`sf'test.txt"
>  http://www.stata.com/support/updates/stata9.html
>
>
> **clean it up a bit by filtering out the HTML tags I don't need (you could
> put all possible HTML tags in this list)**
>
> tokenize <tr> </tr> <b> </b> <i> </i> <td> </td> </a> </th>
> while "`1'" != "" {
>         filefilter "`sf'test.txt" "`sf'testnew.txt", from("`1'") to("")
> replace
>         filefilter "`sf'testnew.txt" "`sf'test.txt", from("\Q") to("%")
> replace
>        /*I find that it's useful to get rid of the quotes at this point */
>        mac shift
>    }
> intext using "`sf'test.txt", gen(html) length(90)
> save "`sf'test.dta", replace
>
>
> /*                            NOTE
> Assuming your data is very structured/consistent,
> at this point you could just -keep- the variables
> that contain your target information (so,
> here you could keep rows 194/217 )
> If you want to find certain fields or data
>  indicated by some tag, use the steps below:
> */
>
> **
>
> /*                            NOTE
> For this Stata9 web page table,
> we could use # as the indicator for the row/column headings
> and we could use "<a href=" as an indicator of the cell values
> */
>
> findval "#", substr gen(flag_headings)
> findval "<a href=", substr gen(flag_cells)
> drop if flag_headings==0 & flag_cells==0
> keep html1
> **
>
> **Finally,  split  html1 & create a -substr- to clean up**
>
> gen colheadings = strpos(html1, "#CFCFCF;%>")
> gen rowheadings = strpos(html1, "#EFEFEF;%>")
> gen cells_win = strpos(html1, "/win/%>")
> gen cells_mac = strpos(html1, "/mac/%>")
> drop if colheadings==0 & rowheadings==0 &  cells_win==0 &  cells_mac==0
>
> **
> foreach v in colheadings rowheadings cells_win cells_mac {
>    gen str20 `v'2 = ""
> }
> replace colheadings2 = substr(html1, colheadings+10,.) if colheadings>0
> replace rowheadings2 = substr(html1, rowheadings+10,.) if rowheadings>0
> replace cells_win2 = substr(html1, cells_win+7,.) if cells_win>0
> replace cells_mac2 = substr(html1, cells_mac+7,.) if cells_mac>0
> **
> drop colheadings rowheadings cells_win cells_mac html1
> list, noobs sep(1) div
> save "`sf'test_final.dta", replace
> ************************
>
>
> This leaves me with the following table, which you could then use to pluck
> out the elements you needed for values in a table or variable/value labels
> elsewhere:
>
>  +----------------------------------------------------------------------------+
>  | colhea~2 |      rowheadings2 |          cells_win2 |           cells_mac2
> |
>  |----------+-------------------+---------------------+-----------------------|
>  |   Update |                   |                     |
>   |
>  |----------+-------------------+---------------------+-----------------------|
>  | Platform |                   |                     |
>   |
>  |----------+-------------------+---------------------+-----------------------|
>  |               | Stata executables |                     |
>       |
>  |----------+-------------------+---------------------+-----------------------|
>  |               |                                  | Stata 9 for Windows |
>                       |
>  |----------+-------------------+---------------------+-----------------------|
>  |          |                   |                                       |
> Stata 9 for Macintosh |
>  |----------+-------------------+---------------------+-----------------------|
>  |                 |   Stata ado-files |                     |
>         |
>
>
> One of the many advantages of using !curl (or !url2file if you've got
> Windows), in lieu of Stata's -copy-,  is that you can send a pre-filled form
> to the website by utilizing the "--form <name=content>" option.  So, if you
> are downloading all the tables generated by a site that requests that you
> fill out a form or select some form attributes before you get each file, you
> could send the information that should go into those fields using !curl &
> automatically download the data (assuming there is no CAPTCHA).
>
> Again, I do too much of this type of thing, so I too would also be
> interested in any suggestions or other approaches to solving Friedrich's
> HTML conversion problem.
>
>
>
>
> Best,
>
> Eric
>
>
> __
> Eric A. Booth
> Public Policy Research Institute
> Texas A&M University
> ebooth@ppri.tamu.edu
> Office: +979.845.6754
>
>
>
>
> On Jul 13, 2009, at 5:05 PM, Friedrich Huebler wrote:
>
>> I have a set of Excel files that I convert to Stata format with
>> Stat/Transfer, called from Stata with -stcmd- by Roger Newson. Some of
>> the original files are HTML files with an XLS extension that cannot be
>> converted by Stat/Transfer. I can open these files with Excel and save
>> them in native Excel format but would prefer a solution that does not
>> involve Excel. Can anyone recommend a method to read HTML files into
>> Stata? There are a number of add-ons that allow export to HTML format
>> but I found nothing that goes the other way, from HTML to Stata.
>>
>> Thanks,
>>
>> Friedrich

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index