Notice: On March 31, it was **announced** that Statalist is moving from an email list to a **forum**. The old list will shut down on April 23, and its replacement, **statalist.org** is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Sergiy Radyakin <serjradyakin@gmail.com> |

To |
"statalist@hsphsun2.harvard.edu" <statalist@hsphsun2.harvard.edu> |

Subject |
Re: st: reading HTML source in Chinese but get a messy code |

Date |
Mon, 10 Jun 2013 01:06:37 -0400 |

Tony, if your choice of package is based solely on whether it supports unicode or not, I would probably recommend Microsoft's Excel or OpenOffice's Calc. However since you are in this forum, you probably intend to do some statistical processing of that information. In that case what is that analysis? In many cases you actually don't need to see the text, but just rely on the package to handle it. If you don't want to specify, consider SPSS or SAS which (according to the manufacturers) both support unicode. I have also asked if new Stata 13 supports unicode and hope for the best. If you want to harvest information about the user profiles, you will need to check with the site owner whether this would be permitted, and if you have valid scientific needs to do it, perhaps, the owner might simply pass that information to you in an organized way. From what it appears on this page you can pull some information like ID, age, gender, phone, and province, and the rest (name, address) is hardly of any value for statistical processing. Best, Sergiy On Sat, Jun 8, 2013 at 9:53 PM, Li Chuntao (Tony) <leechtcn@gmail.com> wrote: > Dear Sergiy, > > Thank you for your advice. Actually i need the whole lines of > information from Line 2~9. Maybe Stata just cannot handle it because > of the unicode problem. If you know any package can do it, please > advice. > > thanks again > > Tony > > > On Sun, Jun 9, 2013 at 12:09 AM, Sergiy Radyakin <serjradyakin@gmail.com> wrote: >> Tony, after visiting the link I see in lines 2-9 characters in >> Chinese. Stata will not show you these characters because Stata does >> not work with unicode. To see the file through Stata's eyes, go to the >> link you posted in FireFox, then go to the menu View-->Character >> Encoding-->More Encodings-->West European-->Western(Windows-1252). >> This is what you can import into Stata and, yes, it does look messy. >> This is the best you can get with it. The good thing is that if you >> process your data in Stata and then output the same messy text you >> will end up with a very readable text, but readable elsewhere (e.g. in >> notepad or a browser). To cut it short, if your analysis requires e.g. >> search of a substring in a text - you might do it by searching for >> byte sequences, and those sequences would not look intuitive at all. >> But if it is something more involved then you might want to rethink >> the choice of a package to do it. Perhaps if you describe the broad >> goal of what you are doing it would be easier to advise. >> Best, Sergiy >> >> On Sat, Jun 8, 2013 at 11:20 AM, Li Chuntao (Tony) <leechtcn@gmail.com> wrote: >>> Dear Prof. Nick, >>> >>> Line 2 to 9 are what i want, from the page of >>> http://html2text.theinfo.org/?url=++http%3A%2F%2Fqq.ico.la%2Fqq459322464.html >>> >>> thanks >>> >>> Tony >>> >>> >>> On Sat, Jun 8, 2013 at 11:08 PM, Nick Cox <njcoxstata@gmail.com> wrote: >>>> OK. >>>> >>>> Looking at the file in a text editor shows that alternate lines are >>>> blank. I don't know which lines are data for you. >>>> Nick >>>> njcoxstata@gmail.com >>>> >>>> >>>> On 8 June 2013 16:04, Li Chuntao (Tony) <leechtcn@gmail.com> wrote: >>>>> Yes, of course i understand my own code. Here i just want to display >>>>> the first two lines to show that there is a messay output and seeking >>>>> helps. >>>>> >>>>> Thank you, Nick, for your always kind help helpfulness >>>>> >>>>> Tony >>>>> >>>>> >>>>> >>>>> On Sat, Jun 8, 2013 at 10:59 PM, Nick Cox <njcoxstata@gmail.com> wrote: >>>>>> Your own code doesn't seem well matched to the input. In your first >>>>>> post you were looping over the lines of the file, reading them one by >>>>>> one and then processing them. You have abandoned that here. Do you >>>>>> understand what the original Mata code does? >>>>>> Nick >>>>>> njcoxstata@gmail.com >>>>>> >>>>>> >>>>>> On 8 June 2013 15:51, Li Chuntao (Tony) <leechtcn@gmail.com> wrote: >>>>>>> Well, it still does not work, as can be seen from the output of the >>>>>>> following codes: >>>>>>> >>>>>>> I mean, the output from http://html2text.theinfo.org seems quite >>>>>>> clean, but it turns to a messay when i tried to read it into Stata, >>>>>>> weather by insheet using or by the Mata code followed. >>>>>>> >>>>>>> Do anyone have such an experience? >>>>>>> >>>>>>> >>>>>>> thanks >>>>>>> >>>>>>> Chuntao >>>>>>> >>>>>>> >>>>>>> copy "http://html2text.theinfo.org/?url=++http%3A%2F%2Fqq.ico.la%2Fqq459322464.html"; >>>>>>> d:\temp.txt, replace >>>>>>> mata: >>>>>>> fh = fopen("d:\temp.txt", "r") >>>>>>> junk=fget(fh) >>>>>>> junk >>>>>>> junk=fget(fh) >>>>>>> junk >>>>>>> >>>>>>> } >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Fri, Jun 7, 2013 at 2:36 AM, Sergiy Radyakin <serjradyakin@gmail.com> wrote: >>>>>>>> Chuntao, >>>>>>>> >>>>>>>> adding to Nick's comments, you don't have to parse HTML code yourself >>>>>>>> as this is a pretty standard task. For your purposes the following >>>>>>>> should yield a pretty clean file: >>>>>>>> http://html2text.theinfo.org/?url=http%3A%2F%2Fqq.ico.la%2Fqq459322466.html >>>>>>>> >>>>>>>> where you supply your URL as a parameter. >>>>>>>> >>>>>>>> Best, Sergiy Radyakin >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Jun 6, 2013 at 12:58 PM, Nick Cox <njcoxstata@gmail.com> wrote: >>>>>>>>> If a file contains junk in lines 1 to 31, don't skip lines 1 to 34! >>>>>>>>> >>>>>>>>> A more fundamental point is that this is HTML: >>>>>>>>> >>>>>>>>> 1. So, lines will necessarily include HTML markup code in many if not >>>>>>>>> all lines. You will need to strip those too, or interpret them. >>>>>>>>> >>>>>>>>> 2. Mark-up code won't necessarily be interpretable if you ignore previous lines. >>>>>>>>> >>>>>>>>> In this particular case, there are many references to yet other files, >>>>>>>>> perhaps not of concern to you. >>>>>>>>> >>>>>>>>> I can't read Chinese, so that is far as I go. >>>>>>>>> >>>>>>>>> Nick >>>>>>>>> njcoxstata@gmail.com >>>>>>>>> >>>>>>>>> >>>>>>>>> On 6 June 2013 14:36, Li Chuntao (Tony) <leechtcn@gmail.com> wrote: >>>>>>>>>> Dear Listers, >>>>>>>>>> >>>>>>>>>> I want to import the following HTML source files: >>>>>>>>>> >>>>>>>>>> http://qq.ico.la/qq459322466.html >>>>>>>>>> >>>>>>>>>> The source file contains some information in Chinese, which is >>>>>>>>>> located in line 32 to 73. >>>>>>>>>> >>>>>>>>>> i tried to import the information by using the following code: >>>>>>>>>> >>>>>>>>>> clear all >>>>>>>>>> set obs 500 >>>>>>>>>> copy "http://qq.ico.la/qq459322466.html"; d:\qq.txt, replace >>>>>>>>>> >>>>>>>>>> mata: >>>>>>>>>> fh = fopen("d:\qq.txt", "r") >>>>>>>>>> for(i=1; i<=34; i++) { >>>>>>>>>> junk=fget(fh) >>>>>>>>>> } >>>>>>>>>> for(i=; i<=20; i++) { >>>>>>>>>> junk=fget(fh) >>>>>>>>>> junk >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> end >>>>>>>>>> >>>>>>>>>> but the result data in memory is only a messy. >>>>>>>>>> >>>>>>>>>> Similar code has been used for other webpage, thanks to Prof. Kit >>>>>>>>>> Baum, as can be seen following: >>>>>>>>>> >>>>>>>>>> clear all >>>>>>>>>> set obs 500 >>>>>>>>>> local stkcd="000002" >>>>>>>>>> gen str20 date="2012.12.31" >>>>>>>>>> copy "http://stockdata.stock.hexun.com/2008/lr.aspx?stockid=`stkcd'&accountdate=2012.12.31" >>>>>>>>>> d:\date.txt, replace >>>>>>>>>> mata: >>>>>>>>>> fh = fopen("d:\date.txt", "r") >>>>>>>>>> for(i=1; i<=444; i++) { >>>>>>>>>> junk=fget(fh) >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> Can someone familiar with Chinese encoding give me some hits? >>>>>>>>>> >>>>>>>>>> Best >>>>>>>>>> >>>>>>>>>> Chuntao >>>>>>>>>> * >>>>>>>>>> * For searches and help try: >>>>>>>>>> * http://www.stata.com/help.cgi?search >>>>>>>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>>>>>>>>> * http://www.ats.ucla.edu/stat/stata/ >>>>>>>>> * >>>>>>>>> * For searches and help try: >>>>>>>>> * http://www.stata.com/help.cgi?search >>>>>>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>>>>>>>> * http://www.ats.ucla.edu/stat/stata/ >>>>>>>> * >>>>>>>> * For searches and help try: >>>>>>>> * http://www.stata.com/help.cgi?search >>>>>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>>>>>>> * http://www.ats.ucla.edu/stat/stata/ >>>>>>> * >>>>>>> * For searches and help try: >>>>>>> * http://www.stata.com/help.cgi?search >>>>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>>>>>> * http://www.ats.ucla.edu/stat/stata/ >>>>>> * >>>>>> * For searches and help try: >>>>>> * http://www.stata.com/help.cgi?search >>>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>>>>> * http://www.ats.ucla.edu/stat/stata/ >>>>> * >>>>> * For searches and help try: >>>>> * http://www.stata.com/help.cgi?search >>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>>>> * http://www.ats.ucla.edu/stat/stata/ >>>> * >>>> * For searches and help try: >>>> * http://www.stata.com/help.cgi?search >>>> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>>> * http://www.ats.ucla.edu/stat/stata/ >>> * >>> * For searches and help try: >>> * http://www.stata.com/help.cgi?search >>> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>> * http://www.ats.ucla.edu/stat/stata/ >> * >> * For searches and help try: >> * http://www.stata.com/help.cgi?search >> * http://www.stata.com/support/faqs/resources/statalist-faq/ >> * http://www.ats.ucla.edu/stat/stata/ > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/faqs/resources/statalist-faq/ > * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**Re: st: reading HTML source in Chinese but get a messy code***From:*Sergiy Radyakin <serjradyakin@gmail.com>

**References**:**st: reading HTML source in Chinese but get a messy code***From:*"Li Chuntao (Tony)" <leechtcn@gmail.com>

**Re: st: reading HTML source in Chinese but get a messy code***From:*Nick Cox <njcoxstata@gmail.com>

**Re: st: reading HTML source in Chinese but get a messy code***From:*Sergiy Radyakin <serjradyakin@gmail.com>

**Re: st: reading HTML source in Chinese but get a messy code***From:*"Li Chuntao (Tony)" <leechtcn@gmail.com>

**Re: st: reading HTML source in Chinese but get a messy code***From:*Nick Cox <njcoxstata@gmail.com>

**Re: st: reading HTML source in Chinese but get a messy code***From:*"Li Chuntao (Tony)" <leechtcn@gmail.com>

**Re: st: reading HTML source in Chinese but get a messy code***From:*Nick Cox <njcoxstata@gmail.com>

**Re: st: reading HTML source in Chinese but get a messy code***From:*"Li Chuntao (Tony)" <leechtcn@gmail.com>

**Re: st: reading HTML source in Chinese but get a messy code***From:*Sergiy Radyakin <serjradyakin@gmail.com>

**Re: st: reading HTML source in Chinese but get a messy code***From:*"Li Chuntao (Tony)" <leechtcn@gmail.com>

- Prev by Date:
**Re: st: How to fill in the missing data** - Next by Date:
**Re: st: How to fill in the missing data** - Previous by thread:
**Re: st: reading HTML source in Chinese but get a messy code** - Next by thread:
**Re: st: reading HTML source in Chinese but get a messy code** - Index(es):