Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: import html , what is the proper way?

From	Sergiy Radyakin <[email protected]>
To	"[email protected]" <[email protected]>
Subject	Re: st: import html , what is the proper way?
Date	Wed, 5 Feb 2014 14:58:04 -0500

Lucas,
there are a few good reasons to harvest data outside Stata:

1) the page might require cookies or authentication;
2) data might be not in html, but e.g. behind a JavaScript call, or in Flash;
3) page might be in unicode (although the data you are after might
still be just numbers);
4) etc;

In general it is an arms race between the data providers and data
scrappers, dating back to the very first email harvesters.

If the site you are working with is free of these problems, there is
no problem working in Stata. As for the particular question you asked
- use -file open- and read the file directly. Then you have full
control, and
file read fh oneline
is doing exactly what you want - reads one text line from the file.
Help file has an example for looping over all lines in the file.

For HTML you might need to concatenate multiple lines into one line,
to be able to match regular expressions.

Best, Sergiy Radyakin

On Wed, Feb 5, 2014 at 1:01 PM, Lucas Ferreira Mation
<[email protected]> wrote:
> Specific to my case:   I copied the html code of the page to disk,
> opened it in Notepad++, and asked for it to "show all characters", so
> I can se the hidden end of line characters. At the end of each line of
> the HTML there is a "LF" or "CR LF".
> However when I try to open it using:
>
> infile str244 text using "$url", clear
>
> the infile command is interpreting the spaces as line breaks. Any idea
> on how to solve this?
>
>
> General comment:
> In these references (tks Friedrich) and other I saw, there is a notion
> that web scrapping should be done outside Stata, using programs in
> phyton, ruby, etc. At least for extracting data from  static pages
> with URLs that make some sense, it should be straightforward from
> within Stata:
>
> 1) understand the structure of the ulrs, and loop through them.
> 2) import each page to Stata
> 3) do string matchs and find regular expressions to find the data needed.
>
> If you are only familiar with Stata it is an additional cost to learn
> a new language syntax. And it seems the only thing missing is the
> ability to import from HTML without messing the structure of the html
> file too much.
>
>
>
>
>
>
>
>
>
> On Wed, Feb 5, 2014 at 11:50 AM, Nick Cox <[email protected]> wrote:
>> The extra detail you now give on your problem is indeed crucial.
>>
>> -insheet- is, as I understand it, written on the assumption that the
>> file being read is a text file in a form suitable, notably, for direct
>> import to a spreadsheet. Its working even roughly for a file with
>> mark-up present is good fortune and not built in by design. Its source
>> code is not visible to users so I can't comment on internal details.
>>
>> I'd expect to have to write a program based on -file- to handle your
>> problem. Others may have more optimistic news for you.
>> Nick
>> [email protected]
>>
>>
>> On 5 February 2014 13:30, Lucas Ferreira Mation <[email protected]> wrote:
>>> Thank you Nick.
>>> Copy+Paste won´t work. I did not explain in the original email, but
>>> the page bellow is just one of several other pages. I'm actually doing
>>> it recursively for hundreds of pages, sort of web scrapping.
>>>
>>> After importing, for each page I extract the URLs of the projects, the
>>>  project names and the project numbers, trowing away the HTML tags and
>>> everything else.
>>>
>>> The problem I'm having is importing the data.
>>> How does "insheet" (or the web browser for that matter) knows what to
>>> interpret what is a line break in a html file?
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Feb 5, 2014 at 11:07 AM, Nick Cox <[email protected]> wrote:
>>>> I don't think there can be a single proper way to import HTML files,
>>>> as HTML is a mark-up language, not a file format defining a
>>>> Stata-compatible data file.
>>>>
>>>> In the example you give there is just a list of projects. Is that the
>>>> data? If it is copy-and-paste from what you see in the browser into
>>>> Stata's editor gives a good start, after which you just -drop-
>>>> unwanted lines. I don't see that you want to import the mark-up at
>>>> all.
>>>> Nick
>>>> [email protected]
>>>>
>>>>
>>>> On 5 February 2014 12:38, Lucas Ferreira Mation <[email protected]> wrote:
>>>>> Helo,
>>>>>
>>>>> I'm trying to import data from the web page. From previous post, I saw
>>>>> there are two ways to import from html, "insheet" or "infile"
>>>>> (sometimes preceded by "copy" > "filefilter" to filter breaks and
>>>>> unwanted html tags). I tryed both ways:
>>>>>
>>>>> . version 12.1 // stata12.1 running on a windows 7 machine
>>>>> . global url http://www.ipea.gov.br/portal/index.php?option=com_content&view=article&id=16643&catid=117&Itemid=5
>>>>> . insheet using "$url", clear
>>>>> . infile str244 text using "$url", clear
>>>>>
>>>>> Neither really works:
>>>>>
>>>>> infile : imported file is all corrupt, it seems that every space as
>>>>> interpreted as a line break. Can I solve this with filefilter?
>>>>>
>>>>> insheet: line breaks seem to be fairly ok (although not perfect in all
>>>>> cases), but some rows were split into different columns ( I suppose
>>>>> the lines that had a "," in them). Is there a "never occurring
>>>>> delimiter" that I could use so the variables are never split?
>>>>>
>>>>> More generally, is there a way to import from HTML so that the
>>>>> imported file looks just like what the source code I see in the
>>>>> browser?
>>>>>
>>>>> tks
>>>>> Lucas
>>>>> *
>>>>> *   For searches and help try:
>>>>> *   http://www.stata.com/help.cgi?search
>>>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>>>> *   http://www.ats.ucla.edu/stat/stata/
>>>> *
>>>> *   For searches and help try:
>>>> *   http://www.stata.com/help.cgi?search
>>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>>> *   http://www.ats.ucla.edu/stat/stata/
>>>
>>> *
>>> *   For searches and help try:
>>> *   http://www.stata.com/help.cgi?search
>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>> *   http://www.ats.ucla.edu/stat/stata/
>>
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> *   http://www.ats.ucla.edu/stat/stata/
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

References:
- st: import html , what is the proper way?
  - From: Lucas Ferreira Mation <[email protected]>
- Re: st: import html , what is the proper way?
  - From: Nick Cox <[email protected]>
- Re: st: import html , what is the proper way?
  - From: Lucas Ferreira Mation <[email protected]>
- Re: st: import html , what is the proper way?
  - From: Nick Cox <[email protected]>
- Re: st: import html , what is the proper way?
  - From: Lucas Ferreira Mation <[email protected]>

Prev by Date: RE: st: Trouble with estat classification
Next by Date: st: An excellent discussion of graphical issues
Previous by thread: Re: st: import html , what is the proper way?
Next by thread: st: Pick the first record with a value
Index(es):
- Date
- Thread