Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: reading HTML source in Chinese but get a messy code

From	Nick Cox <[email protected]>
To	"[email protected]" <[email protected]>
Subject	Re: st: reading HTML source in Chinese but get a messy code
Date	Thu, 6 Jun 2013 17:58:33 +0100

If a file contains junk in lines 1 to 31, don't skip lines 1 to 34!

A more fundamental point is that this is HTML:

1. So, lines will necessarily include HTML markup code in many if not
all lines. You will need to strip those too, or interpret them.

2. Mark-up code won't necessarily be interpretable if you ignore previous lines.

In this particular case, there are many references to yet other files,
perhaps not of concern to you.

I can't read Chinese, so that is far as I go.

Nick
[email protected]


On 6 June 2013 14:36, Li Chuntao (Tony) <[email protected]> wrote:
> Dear Listers,
>
>        I want to import the following HTML source files:
>
>         http://qq.ico.la/qq459322466.html
>
>         The source file contains some information in Chinese, which is
> located in line 32 to 73.
>
>          i tried to import the information by using the following code:
>
> clear all
> set obs 500
> copy  "http://qq.ico.la/qq459322466.html"; d:\qq.txt, replace
>
> mata:
>         fh = fopen("d:\qq.txt", "r")
>         for(i=1; i<=34; i++) {
>         junk=fget(fh)
>         }
>         for(i=; i<=20; i++) {
>         junk=fget(fh)
>         junk
>         }
>
> end
>
> but the result data in memory is only a messy.
>
> Similar code has been used for other webpage, thanks to Prof. Kit
> Baum, as can be seen following:
>
> clear all
> set obs 500
> local stkcd="000002"
> gen str20 date="2012.12.31"
> copy "http://stockdata.stock.hexun.com/2008/lr.aspx?stockid=`stkcd'&accountdate=2012.12.31"
>  d:\date.txt, replace
> mata:
>         fh = fopen("d:\date.txt", "r")
>         for(i=1; i<=444; i++) {
>         junk=fget(fh)
>         }
>
> Can someone familiar with Chinese encoding give me some hits?
>
> Best
>
> Chuntao
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: reading HTML source in Chinese but get a messy code
  - From: Sergiy Radyakin <[email protected]>

References:
- st: reading HTML source in Chinese but get a messy code
  - From: "Li Chuntao (Tony)" <[email protected]>

Prev by Date: Re: st: Generating a Variable indicating Change
Next by Date: Re: st: ML Programming
Previous by thread: st: reading HTML source in Chinese but get a messy code
Next by thread: Re: st: reading HTML source in Chinese but get a messy code
Index(es):
- Date
- Thread