Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: import html , what is the proper way?


From   Nick Cox <[email protected]>
To   "[email protected]" <[email protected]>
Subject   Re: st: import html , what is the proper way?
Date   Wed, 5 Feb 2014 13:50:08 +0000

The extra detail you now give on your problem is indeed crucial.

-insheet- is, as I understand it, written on the assumption that the
file being read is a text file in a form suitable, notably, for direct
import to a spreadsheet. Its working even roughly for a file with
mark-up present is good fortune and not built in by design. Its source
code is not visible to users so I can't comment on internal details.

I'd expect to have to write a program based on -file- to handle your
problem. Others may have more optimistic news for you.
Nick
[email protected]


On 5 February 2014 13:30, Lucas Ferreira Mation <[email protected]> wrote:
> Thank you Nick.
> Copy+Paste won´t work. I did not explain in the original email, but
> the page bellow is just one of several other pages. I'm actually doing
> it recursively for hundreds of pages, sort of web scrapping.
>
> After importing, for each page I extract the URLs of the projects, the
>  project names and the project numbers, trowing away the HTML tags and
> everything else.
>
> The problem I'm having is importing the data.
> How does "insheet" (or the web browser for that matter) knows what to
> interpret what is a line break in a html file?
>
>
>
>
>
> On Wed, Feb 5, 2014 at 11:07 AM, Nick Cox <[email protected]> wrote:
>> I don't think there can be a single proper way to import HTML files,
>> as HTML is a mark-up language, not a file format defining a
>> Stata-compatible data file.
>>
>> In the example you give there is just a list of projects. Is that the
>> data? If it is copy-and-paste from what you see in the browser into
>> Stata's editor gives a good start, after which you just -drop-
>> unwanted lines. I don't see that you want to import the mark-up at
>> all.
>> Nick
>> [email protected]
>>
>>
>> On 5 February 2014 12:38, Lucas Ferreira Mation <[email protected]> wrote:
>>> Helo,
>>>
>>> I'm trying to import data from the web page. From previous post, I saw
>>> there are two ways to import from html, "insheet" or "infile"
>>> (sometimes preceded by "copy" > "filefilter" to filter breaks and
>>> unwanted html tags). I tryed both ways:
>>>
>>> . version 12.1 // stata12.1 running on a windows 7 machine
>>> . global url http://www.ipea.gov.br/portal/index.php?option=com_content&view=article&id=16643&catid=117&Itemid=5
>>> . insheet using "$url", clear
>>> . infile str244 text using "$url", clear
>>>
>>> Neither really works:
>>>
>>> infile : imported file is all corrupt, it seems that every space as
>>> interpreted as a line break. Can I solve this with filefilter?
>>>
>>> insheet: line breaks seem to be fairly ok (although not perfect in all
>>> cases), but some rows were split into different columns ( I suppose
>>> the lines that had a "," in them). Is there a "never occurring
>>> delimiter" that I could use so the variables are never split?
>>>
>>> More generally, is there a way to import from HTML so that the
>>> imported file looks just like what the source code I see in the
>>> browser?
>>>
>>> tks
>>> Lucas
>>> *
>>> *   For searches and help try:
>>> *   http://www.stata.com/help.cgi?search
>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>> *   http://www.ats.ucla.edu/stat/stata/
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> *   http://www.ats.ucla.edu/stat/stata/
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index