Notice: On March 31, it was **announced** that Statalist is moving from an email list to a **forum**. The old list will shut down on April 23, and its replacement, **statalist.org** is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
"Li Chuntao (Tony)" <leechtcn@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: reading HTML source in Chinese but get a messy code |

Date |
Sat, 8 Jun 2013 23:20:42 +0800 |

Dear Prof. Nick, Line 2 to 9 are what i want, from the page of http://html2text.theinfo.org/?url=++http%3A%2F%2Fqq.ico.la%2Fqq459322464.html thanks Tony On Sat, Jun 8, 2013 at 11:08 PM, Nick Cox <njcoxstata@gmail.com> wrote: > OK. > > Looking at the file in a text editor shows that alternate lines are > blank. I don't know which lines are data for you. > Nick > njcoxstata@gmail.com > > > On 8 June 2013 16:04, Li Chuntao (Tony) <leechtcn@gmail.com> wrote: >> Yes, of course i understand my own code. Here i just want to display >> the first two lines to show that there is a messay output and seeking >> helps. >> >> Thank you, Nick, for your always kind help helpfulness >> >> Tony >> >> >> >> On Sat, Jun 8, 2013 at 10:59 PM, Nick Cox <njcoxstata@gmail.com> wrote: >>> Your own code doesn't seem well matched to the input. In your first >>> post you were looping over the lines of the file, reading them one by >>> one and then processing them. You have abandoned that here. Do you >>> understand what the original Mata code does? >>> Nick >>> njcoxstata@gmail.com >>> >>> >>> On 8 June 2013 15:51, Li Chuntao (Tony) <leechtcn@gmail.com> wrote: >>>> Well, it still does not work, as can be seen from the output of the >>>> following codes: >>>> >>>> I mean, the output from http://html2text.theinfo.org seems quite >>>> clean, but it turns to a messay when i tried to read it into Stata, >>>> weather by insheet using or by the Mata code followed. >>>> >>>> Do anyone have such an experience? >>>> >>>> >>>> thanks >>>> >>>> Chuntao >>>> >>>> >>>> copy "http://html2text.theinfo.org/?url=++http%3A%2F%2Fqq.ico.la%2Fqq459322464.html"; >>>> d:\temp.txt, replace >>>> mata: >>>> fh = fopen("d:\temp.txt", "r") >>>> junk=fget(fh) >>>> junk >>>> junk=fget(fh) >>>> junk >>>> >>>> } >>>> >>>> >>>> >>>> On Fri, Jun 7, 2013 at 2:36 AM, Sergiy Radyakin <serjradyakin@gmail.com> wrote: >>>>> Chuntao, >>>>> >>>>> adding to Nick's comments, you don't have to parse HTML code yourself >>>>> as this is a pretty standard task. For your purposes the following >>>>> should yield a pretty clean file: >>>>> http://html2text.theinfo.org/?url=http%3A%2F%2Fqq.ico.la%2Fqq459322466.html >>>>> >>>>> where you supply your URL as a parameter. >>>>> >>>>> Best, Sergiy Radyakin >>>>> >>>>> >>>>> On Thu, Jun 6, 2013 at 12:58 PM, Nick Cox <njcoxstata@gmail.com> wrote: >>>>>> If a file contains junk in lines 1 to 31, don't skip lines 1 to 34! >>>>>> >>>>>> A more fundamental point is that this is HTML: >>>>>> >>>>>> 1. So, lines will necessarily include HTML markup code in many if not >>>>>> all lines. You will need to strip those too, or interpret them. >>>>>> >>>>>> 2. Mark-up code won't necessarily be interpretable if you ignore previous lines. >>>>>> >>>>>> In this particular case, there are many references to yet other files, >>>>>> perhaps not of concern to you. >>>>>> >>>>>> I can't read Chinese, so that is far as I go. >>>>>> >>>>>> Nick >>>>>> njcoxstata@gmail.com >>>>>> >>>>>> >>>>>> On 6 June 2013 14:36, Li Chuntao (Tony) <leechtcn@gmail.com> wrote: >>>>>>> Dear Listers, >>>>>>> >>>>>>> I want to import the following HTML source files: >>>>>>> >>>>>>> http://qq.ico.la/qq459322466.html >>>>>>> >>>>>>> The source file contains some information in Chinese, which is >>>>>>> located in line 32 to 73. >>>>>>> >>>>>>> i tried to import the information by using the following code: >>>>>>> >>>>>>> clear all >>>>>>> set obs 500 >>>>>>> copy "http://qq.ico.la/qq459322466.html"; d:\qq.txt, replace >>>>>>> >>>>>>> mata: >>>>>>> fh = fopen("d:\qq.txt", "r") >>>>>>> for(i=1; i<=34; i++) { >>>>>>> junk=fget(fh) >>>>>>> } >>>>>>> for(i=; i<=20; i++) { >>>>>>> junk=fget(fh) >>>>>>> junk >>>>>>> } >>>>>>> >>>>>>> end >>>>>>> >>>>>>> but the result data in memory is only a messy. >>>>>>> >>>>>>> Similar code has been used for other webpage, thanks to Prof. Kit >>>>>>> Baum, as can be seen following: >>>>>>> >>>>>>> clear all >>>>>>> set obs 500 >>>>>>> local stkcd="000002" >>>>>>> gen str20 date="2012.12.31" >>>>>>> copy "http://stockdata.stock.hexun.com/2008/lr.aspx?stockid=`stkcd'&accountdate=2012.12.31" >>>>>>> d:\date.txt, replace >>>>>>> mata: >>>>>>> fh = fopen("d:\date.txt", "r") >>>>>>> for(i=1; i<=444; i++) { >>>>>>> junk=fget(fh) >>>>>>> } >>>>>>> >>>>>>> Can someone familiar with Chinese encoding give me some hits? >>>>>>> >>>>>>> Best >>>>>>> >>>>>>> Chuntao >>>>>>> * >>>>>>> * For searches and help try: >>>>>>> * http://www.stata.com/help.cgi?search >>>>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>>>>>> * http://www.ats.ucla.edu/stat/stata/ >>>>>> * >>>>>> * For searches and help try: >>>>>> * http://www.stata.com/help.cgi?search >>>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>>>>> * http://www.ats.ucla.edu/stat/stata/ >>>>> * >>>>> * For searches and help try: >>>>> * http://www.stata.com/help.cgi?search >>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>>>> * http://www.ats.ucla.edu/stat/stata/ >>>> * >>>> * For searches and help try: >>>> * http://www.stata.com/help.cgi?search >>>> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>>> * http://www.ats.ucla.edu/stat/stata/ >>> * >>> * For searches and help try: >>> * http://www.stata.com/help.cgi?search >>> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>> * http://www.ats.ucla.edu/stat/stata/ >> * >> * For searches and help try: >> * http://www.stata.com/help.cgi?search >> * http://www.stata.com/support/faqs/resources/statalist-faq/ >> * http://www.ats.ucla.edu/stat/stata/ > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/faqs/resources/statalist-faq/ > * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**Re: st: reading HTML source in Chinese but get a messy code***From:*Sergiy Radyakin <serjradyakin@gmail.com>

**References**:**st: reading HTML source in Chinese but get a messy code***From:*"Li Chuntao (Tony)" <leechtcn@gmail.com>

**Re: st: reading HTML source in Chinese but get a messy code***From:*Nick Cox <njcoxstata@gmail.com>

**Re: st: reading HTML source in Chinese but get a messy code***From:*Sergiy Radyakin <serjradyakin@gmail.com>

**Re: st: reading HTML source in Chinese but get a messy code***From:*"Li Chuntao (Tony)" <leechtcn@gmail.com>

**Re: st: reading HTML source in Chinese but get a messy code***From:*Nick Cox <njcoxstata@gmail.com>

**Re: st: reading HTML source in Chinese but get a messy code***From:*"Li Chuntao (Tony)" <leechtcn@gmail.com>

**Re: st: reading HTML source in Chinese but get a messy code***From:*Nick Cox <njcoxstata@gmail.com>

- Prev by Date:
**Re: st: reading HTML source in Chinese but get a messy code** - Next by Date:
**Re: st: reading HTML source in Chinese but get a messy code** - Previous by thread:
**Re: st: reading HTML source in Chinese but get a messy code** - Next by thread:
**Re: st: reading HTML source in Chinese but get a messy code** - Index(es):