Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | Nick Cox <njcoxstata@gmail.com> |
To | "statalist@hsphsun2.harvard.edu" <statalist@hsphsun2.harvard.edu> |
Subject | Re: st: reading HTML source in Chinese but get a messy code |
Date | Thu, 6 Jun 2013 17:58:33 +0100 |
If a file contains junk in lines 1 to 31, don't skip lines 1 to 34! A more fundamental point is that this is HTML: 1. So, lines will necessarily include HTML markup code in many if not all lines. You will need to strip those too, or interpret them. 2. Mark-up code won't necessarily be interpretable if you ignore previous lines. In this particular case, there are many references to yet other files, perhaps not of concern to you. I can't read Chinese, so that is far as I go. Nick njcoxstata@gmail.com On 6 June 2013 14:36, Li Chuntao (Tony) <leechtcn@gmail.com> wrote: > Dear Listers, > > I want to import the following HTML source files: > > http://qq.ico.la/qq459322466.html > > The source file contains some information in Chinese, which is > located in line 32 to 73. > > i tried to import the information by using the following code: > > clear all > set obs 500 > copy "http://qq.ico.la/qq459322466.html"; d:\qq.txt, replace > > mata: > fh = fopen("d:\qq.txt", "r") > for(i=1; i<=34; i++) { > junk=fget(fh) > } > for(i=; i<=20; i++) { > junk=fget(fh) > junk > } > > end > > but the result data in memory is only a messy. > > Similar code has been used for other webpage, thanks to Prof. Kit > Baum, as can be seen following: > > clear all > set obs 500 > local stkcd="000002" > gen str20 date="2012.12.31" > copy "http://stockdata.stock.hexun.com/2008/lr.aspx?stockid=`stkcd'&accountdate=2012.12.31" > d:\date.txt, replace > mata: > fh = fopen("d:\date.txt", "r") > for(i=1; i<=444; i++) { > junk=fget(fh) > } > > Can someone familiar with Chinese encoding give me some hits? > > Best > > Chuntao > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/faqs/resources/statalist-faq/ > * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/