Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
"Li Chuntao (Tony)" <leechtcn@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: reading HTML source in Chinese but get a messy code |

Date |
Sat, 8 Jun 2013 23:04:49 +0800 |

Yes, of course i understand my own code. Here i just want to display the first two lines to show that there is a messay output and seeking helps. Thank you, Nick, for your always kind help helpfulness Tony On Sat, Jun 8, 2013 at 10:59 PM, Nick Cox <njcoxstata@gmail.com> wrote: > Your own code doesn't seem well matched to the input. In your first > post you were looping over the lines of the file, reading them one by > one and then processing them. You have abandoned that here. Do you > understand what the original Mata code does? > Nick > njcoxstata@gmail.com > > > On 8 June 2013 15:51, Li Chuntao (Tony) <leechtcn@gmail.com> wrote: >> Well, it still does not work, as can be seen from the output of the >> following codes: >> >> I mean, the output from http://html2text.theinfo.org seems quite >> clean, but it turns to a messay when i tried to read it into Stata, >> weather by insheet using or by the Mata code followed. >> >> Do anyone have such an experience? >> >> >> thanks >> >> Chuntao >> >> >> copy "http://html2text.theinfo.org/?url=++http%3A%2F%2Fqq.ico.la%2Fqq459322464.html"; >> d:\temp.txt, replace >> mata: >> fh = fopen("d:\temp.txt", "r") >> junk=fget(fh) >> junk >> junk=fget(fh) >> junk >> >> } >> >> >> >> On Fri, Jun 7, 2013 at 2:36 AM, Sergiy Radyakin <serjradyakin@gmail.com> wrote: >>> Chuntao, >>> >>> adding to Nick's comments, you don't have to parse HTML code yourself >>> as this is a pretty standard task. For your purposes the following >>> should yield a pretty clean file: >>> http://html2text.theinfo.org/?url=http%3A%2F%2Fqq.ico.la%2Fqq459322466.html >>> >>> where you supply your URL as a parameter. >>> >>> Best, Sergiy Radyakin >>> >>> >>> On Thu, Jun 6, 2013 at 12:58 PM, Nick Cox <njcoxstata@gmail.com> wrote: >>>> If a file contains junk in lines 1 to 31, don't skip lines 1 to 34! >>>> >>>> A more fundamental point is that this is HTML: >>>> >>>> 1. So, lines will necessarily include HTML markup code in many if not >>>> all lines. You will need to strip those too, or interpret them. >>>> >>>> 2. Mark-up code won't necessarily be interpretable if you ignore previous lines. >>>> >>>> In this particular case, there are many references to yet other files, >>>> perhaps not of concern to you. >>>> >>>> I can't read Chinese, so that is far as I go. >>>> >>>> Nick >>>> njcoxstata@gmail.com >>>> >>>> >>>> On 6 June 2013 14:36, Li Chuntao (Tony) <leechtcn@gmail.com> wrote: >>>>> Dear Listers, >>>>> >>>>> I want to import the following HTML source files: >>>>> >>>>> http://qq.ico.la/qq459322466.html >>>>> >>>>> The source file contains some information in Chinese, which is >>>>> located in line 32 to 73. >>>>> >>>>> i tried to import the information by using the following code: >>>>> >>>>> clear all >>>>> set obs 500 >>>>> copy "http://qq.ico.la/qq459322466.html"; d:\qq.txt, replace >>>>> >>>>> mata: >>>>> fh = fopen("d:\qq.txt", "r") >>>>> for(i=1; i<=34; i++) { >>>>> junk=fget(fh) >>>>> } >>>>> for(i=; i<=20; i++) { >>>>> junk=fget(fh) >>>>> junk >>>>> } >>>>> >>>>> end >>>>> >>>>> but the result data in memory is only a messy. >>>>> >>>>> Similar code has been used for other webpage, thanks to Prof. Kit >>>>> Baum, as can be seen following: >>>>> >>>>> clear all >>>>> set obs 500 >>>>> local stkcd="000002" >>>>> gen str20 date="2012.12.31" >>>>> copy "http://stockdata.stock.hexun.com/2008/lr.aspx?stockid=`stkcd'&accountdate=2012.12.31" >>>>> d:\date.txt, replace >>>>> mata: >>>>> fh = fopen("d:\date.txt", "r") >>>>> for(i=1; i<=444; i++) { >>>>> junk=fget(fh) >>>>> } >>>>> >>>>> Can someone familiar with Chinese encoding give me some hits? >>>>> >>>>> Best >>>>> >>>>> Chuntao >>>>> * >>>>> * For searches and help try: >>>>> * http://www.stata.com/help.cgi?search >>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>>>> * http://www.ats.ucla.edu/stat/stata/ >>>> * >>>> * For searches and help try: >>>> * http://www.stata.com/help.cgi?search >>>> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>>> * http://www.ats.ucla.edu/stat/stata/ >>> * >>> * For searches and help try: >>> * http://www.stata.com/help.cgi?search >>> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>> * http://www.ats.ucla.edu/stat/stata/ >> * >> * For searches and help try: >> * http://www.stata.com/help.cgi?search >> * http://www.stata.com/support/faqs/resources/statalist-faq/ >> * http://www.ats.ucla.edu/stat/stata/ > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/faqs/resources/statalist-faq/ > * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**Re: st: reading HTML source in Chinese but get a messy code***From:*Nick Cox <njcoxstata@gmail.com>

**References**:**st: reading HTML source in Chinese but get a messy code***From:*"Li Chuntao (Tony)" <leechtcn@gmail.com>

**Re: st: reading HTML source in Chinese but get a messy code***From:*Nick Cox <njcoxstata@gmail.com>

**Re: st: reading HTML source in Chinese but get a messy code***From:*Sergiy Radyakin <serjradyakin@gmail.com>

**Re: st: reading HTML source in Chinese but get a messy code***From:*"Li Chuntao (Tony)" <leechtcn@gmail.com>

**Re: st: reading HTML source in Chinese but get a messy code***From:*Nick Cox <njcoxstata@gmail.com>

- Prev by Date:
**Re: st: reading HTML source in Chinese but get a messy code** - Next by Date:
**Re: st: reading HTML source in Chinese but get a messy code** - Previous by thread:
**Re: st: reading HTML source in Chinese but get a messy code** - Next by thread:
**Re: st: reading HTML source in Chinese but get a messy code** - Index(es):