[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
"Eric A. Booth" <ebooth@ppri.tamu.edu> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: Read HTML file with Stata |

Date |
Mon, 13 Jul 2009 23:26:00 -0500 |

Hello Friedrich:

For a rough example:

********************** clear local sf "`pwd'"

tokenize <tr> </tr> <b> </b> <i> </i> <td> </td> </a> </th> while "`1'" != "" {

/*I find that it's useful to get rid of the quotes at this point */ mac shift } intext using "`sf'test.txt", gen(html) length(90) save "`sf'test.dta", replace /* NOTE Assuming your data is very structured/consistent, at this point you could just -keep- the variables that contain your target information (so, here you could keep rows 194/217 ) If you want to find certain fields or data indicated by some tag, use the steps below: */ ** /* NOTE For this Stata9 web page table, we could use # as the indicator for the row/column headings and we could use "<a href=" as an indicator of the cell values */ findval "#", substr gen(flag_headings) findval "<a href=", substr gen(flag_cells) drop if flag_headings==0 & flag_cells==0 keep html1 ** **Finally, split html1 & create a -substr- to clean up** gen colheadings = strpos(html1, "#CFCFCF;%>") gen rowheadings = strpos(html1, "#EFEFEF;%>") gen cells_win = strpos(html1, "/win/%>") gen cells_mac = strpos(html1, "/mac/%>") drop if colheadings==0 & rowheadings==0 & cells_win==0 & cells_mac==0 ** foreach v in colheadings rowheadings cells_win cells_mac { gen str20 `v'2 = "" } replace colheadings2 = substr(html1, colheadings+10,.) if colheadings>0 replace rowheadings2 = substr(html1, rowheadings+10,.) if rowheadings>0 replace cells_win2 = substr(html1, cells_win+7,.) if cells_win>0 replace cells_mac2 = substr(html1, cells_mac+7,.) if cells_mac>0 ** drop colheadings rowheadings cells_win cells_mac html1 list, noobs sep(1) div save "`sf'test_final.dta", replace ************************

Best, Eric __ Eric A. Booth Public Policy Research Institute Texas A&M University ebooth@ppri.tamu.edu Office: +979.845.6754 On Jul 13, 2009, at 5:05 PM, Friedrich Huebler wrote:

I have a set of Excel files that I convert to Stata format with Stat/Transfer, called from Stata with -stcmd- by Roger Newson. Some of the original files are HTML files with an XLS extension that cannot be converted by Stat/Transfer. I can open these files with Excel and save them in native Excel format but would prefer a solution that does not involve Excel. Can anyone recommend a method to read HTML files into Stata? There are a number of add-ons that allow export to HTML format but I found nothing that goes the other way, from HTML to Stata. Thanks, Friedrich *

* * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**Re: st: Read HTML file with Stata***From:*"Eric A. Booth" <ebooth@ppri.tamu.edu>

**Re: st: Read HTML file with Stata***From:*Friedrich Huebler <fhuebler@gmail.com>

**References**:**st: Read HTML file with Stata***From:*Friedrich Huebler <fhuebler@gmail.com>

- Prev by Date:
**st: xttobit error "number of quadrature points must be less than or equal to number of obs"** - Next by Date:
**st: Re: xttobit error "number of quadrature points must be less than or equal to number of obs"** - Previous by thread:
**st: Read HTML file with Stata** - Next by thread:
**Re: st: Read HTML file with Stata** - Index(es):

© Copyright 1996–2016 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |