Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: import html , what is the proper way?

From	Lucas Ferreira Mation <[email protected]>
To	statalist <[email protected]>
Subject	Re: st: import html , what is the proper way?
Date	Wed, 5 Feb 2014 11:30:49 -0200

Thank you Nick.
Copy+Paste won´t work. I did not explain in the original email, but
the page bellow is just one of several other pages. I'm actually doing
it recursively for hundreds of pages, sort of web scrapping.

After importing, for each page I extract the URLs of the projects, the
 project names and the project numbers, trowing away the HTML tags and
everything else.

The problem I'm having is importing the data.
How does "insheet" (or the web browser for that matter) knows what to
interpret what is a line break in a html file?





On Wed, Feb 5, 2014 at 11:07 AM, Nick Cox <[email protected]> wrote:
> I don't think there can be a single proper way to import HTML files,
> as HTML is a mark-up language, not a file format defining a
> Stata-compatible data file.
>
> In the example you give there is just a list of projects. Is that the
> data? If it is copy-and-paste from what you see in the browser into
> Stata's editor gives a good start, after which you just -drop-
> unwanted lines. I don't see that you want to import the mark-up at
> all.
> Nick
> [email protected]
>
>
> On 5 February 2014 12:38, Lucas Ferreira Mation <[email protected]> wrote:
>> Helo,
>>
>> I'm trying to import data from the web page. From previous post, I saw
>> there are two ways to import from html, "insheet" or "infile"
>> (sometimes preceded by "copy" > "filefilter" to filter breaks and
>> unwanted html tags). I tryed both ways:
>>
>> . version 12.1 // stata12.1 running on a windows 7 machine
>> . global url http://www.ipea.gov.br/portal/index.php?option=com_content&view=article&id=16643&catid=117&Itemid=5
>> . insheet using "$url", clear
>> . infile str244 text using "$url", clear
>>
>> Neither really works:
>>
>> infile : imported file is all corrupt, it seems that every space as
>> interpreted as a line break. Can I solve this with filefilter?
>>
>> insheet: line breaks seem to be fairly ok (although not perfect in all
>> cases), but some rows were split into different columns ( I suppose
>> the lines that had a "," in them). Is there a "never occurring
>> delimiter" that I could use so the variables are never split?
>>
>> More generally, is there a way to import from HTML so that the
>> imported file looks just like what the source code I see in the
>> browser?
>>
>> tks
>> Lucas
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> *   http://www.ats.ucla.edu/stat/stata/
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: import html , what is the proper way?
  - From: Nick Cox <[email protected]>
- Re: st: import html , what is the proper way?
  - From: Friedrich Huebler <[email protected]>

References:
- st: import html , what is the proper way?
  - From: Lucas Ferreira Mation <[email protected]>
- Re: st: import html , what is the proper way?
  - From: Nick Cox <[email protected]>

Prev by Date: Re: st: import html , what is the proper way?
Next by Date: st: Pick the first record with a value
Previous by thread: Re: st: import html , what is the proper way?
Next by thread: Re: st: import html , what is the proper way?
Index(es):
- Date
- Thread