Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | Nick Cox <njcoxstata@gmail.com> |
To | "statalist@hsphsun2.harvard.edu" <statalist@hsphsun2.harvard.edu> |
Subject | Re: st: import html , what is the proper way? |
Date | Wed, 5 Feb 2014 13:50:08 +0000 |
The extra detail you now give on your problem is indeed crucial. -insheet- is, as I understand it, written on the assumption that the file being read is a text file in a form suitable, notably, for direct import to a spreadsheet. Its working even roughly for a file with mark-up present is good fortune and not built in by design. Its source code is not visible to users so I can't comment on internal details. I'd expect to have to write a program based on -file- to handle your problem. Others may have more optimistic news for you. Nick njcoxstata@gmail.com On 5 February 2014 13:30, Lucas Ferreira Mation <lucasmation@gmail.com> wrote: > Thank you Nick. > Copy+Paste won´t work. I did not explain in the original email, but > the page bellow is just one of several other pages. I'm actually doing > it recursively for hundreds of pages, sort of web scrapping. > > After importing, for each page I extract the URLs of the projects, the > project names and the project numbers, trowing away the HTML tags and > everything else. > > The problem I'm having is importing the data. > How does "insheet" (or the web browser for that matter) knows what to > interpret what is a line break in a html file? > > > > > > On Wed, Feb 5, 2014 at 11:07 AM, Nick Cox <njcoxstata@gmail.com> wrote: >> I don't think there can be a single proper way to import HTML files, >> as HTML is a mark-up language, not a file format defining a >> Stata-compatible data file. >> >> In the example you give there is just a list of projects. Is that the >> data? If it is copy-and-paste from what you see in the browser into >> Stata's editor gives a good start, after which you just -drop- >> unwanted lines. I don't see that you want to import the mark-up at >> all. >> Nick >> njcoxstata@gmail.com >> >> >> On 5 February 2014 12:38, Lucas Ferreira Mation <lucasmation@gmail.com> wrote: >>> Helo, >>> >>> I'm trying to import data from the web page. From previous post, I saw >>> there are two ways to import from html, "insheet" or "infile" >>> (sometimes preceded by "copy" > "filefilter" to filter breaks and >>> unwanted html tags). I tryed both ways: >>> >>> . version 12.1 // stata12.1 running on a windows 7 machine >>> . global url http://www.ipea.gov.br/portal/index.php?option=com_content&view=article&id=16643&catid=117&Itemid=5 >>> . insheet using "$url", clear >>> . infile str244 text using "$url", clear >>> >>> Neither really works: >>> >>> infile : imported file is all corrupt, it seems that every space as >>> interpreted as a line break. Can I solve this with filefilter? >>> >>> insheet: line breaks seem to be fairly ok (although not perfect in all >>> cases), but some rows were split into different columns ( I suppose >>> the lines that had a "," in them). Is there a "never occurring >>> delimiter" that I could use so the variables are never split? >>> >>> More generally, is there a way to import from HTML so that the >>> imported file looks just like what the source code I see in the >>> browser? >>> >>> tks >>> Lucas >>> * >>> * For searches and help try: >>> * http://www.stata.com/help.cgi?search >>> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>> * http://www.ats.ucla.edu/stat/stata/ >> * >> * For searches and help try: >> * http://www.stata.com/help.cgi?search >> * http://www.stata.com/support/faqs/resources/statalist-faq/ >> * http://www.ats.ucla.edu/stat/stata/ > > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/faqs/resources/statalist-faq/ > * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/