Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
"Li Chuntao (Tony)" <leechtcn@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: reading HTML source in Chinese but get a messy code |

Date |
Sat, 8 Jun 2013 22:31:09 +0800 |

Thank you, Sergiy and Nick. You help me out! Chuntao On Fri, Jun 7, 2013 at 2:36 AM, Sergiy Radyakin <serjradyakin@gmail.com> wrote: > Chuntao, > > adding to Nick's comments, you don't have to parse HTML code yourself > as this is a pretty standard task. For your purposes the following > should yield a pretty clean file: > http://html2text.theinfo.org/?url=http%3A%2F%2Fqq.ico.la%2Fqq459322466.html > > where you supply your URL as a parameter. > > Best, Sergiy Radyakin > > > On Thu, Jun 6, 2013 at 12:58 PM, Nick Cox <njcoxstata@gmail.com> wrote: >> If a file contains junk in lines 1 to 31, don't skip lines 1 to 34! >> >> A more fundamental point is that this is HTML: >> >> 1. So, lines will necessarily include HTML markup code in many if not >> all lines. You will need to strip those too, or interpret them. >> >> 2. Mark-up code won't necessarily be interpretable if you ignore previous lines. >> >> In this particular case, there are many references to yet other files, >> perhaps not of concern to you. >> >> I can't read Chinese, so that is far as I go. >> >> Nick >> njcoxstata@gmail.com >> >> >> On 6 June 2013 14:36, Li Chuntao (Tony) <leechtcn@gmail.com> wrote: >>> Dear Listers, >>> >>> I want to import the following HTML source files: >>> >>> http://qq.ico.la/qq459322466.html >>> >>> The source file contains some information in Chinese, which is >>> located in line 32 to 73. >>> >>> i tried to import the information by using the following code: >>> >>> clear all >>> set obs 500 >>> copy "http://qq.ico.la/qq459322466.html"; d:\qq.txt, replace >>> >>> mata: >>> fh = fopen("d:\qq.txt", "r") >>> for(i=1; i<=34; i++) { >>> junk=fget(fh) >>> } >>> for(i=; i<=20; i++) { >>> junk=fget(fh) >>> junk >>> } >>> >>> end >>> >>> but the result data in memory is only a messy. >>> >>> Similar code has been used for other webpage, thanks to Prof. Kit >>> Baum, as can be seen following: >>> >>> clear all >>> set obs 500 >>> local stkcd="000002" >>> gen str20 date="2012.12.31" >>> copy "http://stockdata.stock.hexun.com/2008/lr.aspx?stockid=`stkcd'&accountdate=2012.12.31" >>> d:\date.txt, replace >>> mata: >>> fh = fopen("d:\date.txt", "r") >>> for(i=1; i<=444; i++) { >>> junk=fget(fh) >>> } >>> >>> Can someone familiar with Chinese encoding give me some hits? >>> >>> Best >>> >>> Chuntao >>> * >>> * For searches and help try: >>> * http://www.stata.com/help.cgi?search >>> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>> * http://www.ats.ucla.edu/stat/stata/ >> * >> * For searches and help try: >> * http://www.stata.com/help.cgi?search >> * http://www.stata.com/support/faqs/resources/statalist-faq/ >> * http://www.ats.ucla.edu/stat/stata/ > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/faqs/resources/statalist-faq/ > * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/

**References**:**st: reading HTML source in Chinese but get a messy code***From:*"Li Chuntao (Tony)" <leechtcn@gmail.com>

**Re: st: reading HTML source in Chinese but get a messy code***From:*Nick Cox <njcoxstata@gmail.com>

**Re: st: reading HTML source in Chinese but get a messy code***From:*Sergiy Radyakin <serjradyakin@gmail.com>

- Prev by Date:
**Re: st: Adding fixed time and country effects in frontier (SFA)??** - Next by Date:
**Re: st: reading HTML source in Chinese but get a messy code** - Previous by thread:
**Re: st: reading HTML source in Chinese but get a messy code** - Next by thread:
**Re: st: reading HTML source in Chinese but get a messy code** - Index(es):