Sergiy Radyakin <serjradyakin@gmail.com>

"statalist@hsphsun2.harvard.edu" <statalist@hsphsun2.harvard.edu>

Re: st: reading HTML source in Chinese but get a messy code

Mon, 10 Jun 2013 02:14:04 -0400

following statement in the documentation:"str# variables require # bytes per observation".Sergiy. On Mon, Jun 10, 2013 at 1:06 AM, Sergiy Radyakin <serjradyakin@gmail.com> wrote: > Tony, > > if your choice of package is based solely on whether it supports > unicode or not, I would probably recommend Microsoft's Excel or > OpenOffice's Calc. However since you are in this forum, you probably > intend to do some statistical processing of that information. In that > case what is that analysis? In many cases you actually don't need to > see the text, but just rely on the package to handle it. If you don't > want to specify, consider SPSS or SAS which (according to the > manufacturers) both support unicode. I have also asked if new Stata 13 > supports unicode and hope for the best. If you want to harvest > information about the user profiles, you will need to check with the > site owner whether this would be permitted, and if you have valid > scientific needs to do it, perhaps, the owner might simply pass that > information to you in an organized way. From what it appears on this > page you can pull some information like ID, age, gender, phone, and > province, and the rest (name, address) is hardly of any value for > statistical processing. > > Best, Sergiy > > On Sat, Jun 8, 2013 at 9:53 PM, Li Chuntao (Tony) <leechtcn@gmail.com> wrote: >> Dear Sergiy, >> >> Thank you for your advice. Actually i need the whole lines of >> information from Line 2~9. Maybe Stata just cannot handle it because >> of the unicode problem. If you know any package can do it, please >> advice. >> >> thanks again >> >> Tony >> >> >> On Sun, Jun 9, 2013 at 12:09 AM, Sergiy Radyakin <serjradyakin@gmail.com> wrote: >>> Tony, after visiting the link I see in lines 2-9 characters in >>> Chinese. Stata will not show you these characters because Stata does >>> not work with unicode. To see the file through Stata's eyes, go to the >>> link you posted in FireFox, then go to the menu View-->Character >>> Encoding-->More Encodings-->West European-->Western(Windows-1252). >>> This is what you can import into Stata and, yes, it does look messy. >>> This is the best you can get with it. The good thing is that if you >>> process your data in Stata and then output the same messy text you >>> will end up with a very readable text, but readable elsewhere (e.g. in >>> notepad or a browser). To cut it short, if your analysis requires e.g. >>> search of a substring in a text - you might do it by searching for >>> byte sequences, and those sequences would not look intuitive at all. >>> But if it is something more involved then you might want to rethink >>> the choice of a package to do it. Perhaps if you describe the broad >>> goal of what you are doing it would be easier to advise. >>> Best, Sergiy >>> >>> On Sat, Jun 8, 2013 at 11:20 AM, Li Chuntao (Tony) <leechtcn@gmail.com> wrote: >>>> Dear Prof. Nick, >>>> >>>> Line 2 to 9 are what i want, from the page of >>>> http://html2text.theinfo.org/?url=++http%3A%2F%2Fqq.ico.la%2Fqq459322464.html >>>> >>>> thanks >>>> >>>> Tony >>>> >>>> >>>> On Sat, Jun 8, 2013 at 11:08 PM, Nick Cox <njcoxstata@gmail.com> wrote: >>>>> OK. >>>>> >>>>> Looking at the file in a text editor shows that alternate lines are >>>>> blank. I don't know which lines are data for you. >>>>> Nick >>>>> njcoxstata@gmail.com >>>>> >>>>> >>>>> On 8 June 2013 16:04, Li Chuntao (Tony) <leechtcn@gmail.com> wrote: >>>>>> Yes, of course i understand my own code. Here i just want to display >>>>>> the first two lines to show that there is a messay output and seeking >>>>>> helps. >>>>>> >>>>>> Thank you, Nick, for your always kind help helpfulness >>>>>> >>>>>> Tony >>>>>> >>>>>> >>>>>> >>>>>> On Sat, Jun 8, 2013 at 10:59 PM, Nick Cox <njcoxstata@gmail.com> wrote: >>>>>>> Your own code doesn't seem well matched to the input. In your first >>>>>>> post you were looping over the lines of the file, reading them one by >>>>>>> one and then processing them. You have abandoned that here. Do you >>>>>>> understand what the original Mata code does? >>>>>>> Nick >>>>>>> njcoxstata@gmail.com >>>>>>> >>>>>>> >>>>>>> On 8 June 2013 15:51, Li Chuntao (Tony) <leechtcn@gmail.com> wrote: >>>>>>>> Well, it still does not work, as can be seen from the output of the >>>>>>>> following codes: >>>>>>>> >>>>>>>> I mean, the output from http://html2text.theinfo.org seems quite >>>>>>>> clean, but it turns to a messay when i tried to read it into Stata, >>>>>>>> weather by insheet using or by the Mata code followed. >>>>>>>> >>>>>>>> Do anyone have such an experience? >>>>>>>> >>>>>>>> >>>>>>>> thanks >>>>>>>> >>>>>>>> Chuntao >>>>>>>> >>>>>>>> >>>>>>>> copy "http://html2text.theinfo.org/?url=++http%3A%2F%2Fqq.ico.la%2Fqq459322464.html"; >>>>>>>> d:\temp.txt, replace >>>>>>>> mata: >>>>>>>> fh = fopen("d:\temp.txt", "r") >>>>>>>> junk=fget(fh) >>>>>>>> junk >>>>>>>> junk=fget(fh) >>>>>>>> junk >>>>>>>> >>>>>>>> } >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Jun 7, 2013 at 2:36 AM, Sergiy Radyakin <serjradyakin@gmail.com> wrote: >>>>>>>>> Chuntao, >>>>>>>>> >>>>>>>>> adding to Nick's comments, you don't have to parse HTML code yourself >>>>>>>>> as this is a pretty standard task. For your purposes the following >>>>>>>>> should yield a pretty clean file: >>>>>>>>> http://html2text.theinfo.org/?url=http%3A%2F%2Fqq.ico.la%2Fqq459322466.html >>>>>>>>> >>>>>>>>> where you supply your URL as a parameter. >>>>>>>>> >>>>>>>>> Best, Sergiy Radyakin >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, Jun 6, 2013 at 12:58 PM, Nick Cox <njcoxstata@gmail.com> wrote: >>>>>>>>>> If a file contains junk in lines 1 to 31, don't skip lines 1 to 34! >>>>>>>>>> >>>>>>>>>> A more fundamental point is that this is HTML: >>>>>>>>>> >>>>>>>>>> 1. So, lines will necessarily include HTML markup code in many if not >>>>>>>>>> all lines. You will need to strip those too, or interpret them. >>>>>>>>>> >>>>>>>>>> 2. Mark-up code won't necessarily be interpretable if you ignore previous lines. >>>>>>>>>> >>>>>>>>>> In this particular case, there are many references to yet other files, >>>>>>>>>> perhaps not of concern to you. >>>>>>>>>> >>>>>>>>>> I can't read Chinese, so that is far as I go. >>>>>>>>>> >>>>>>>>>> Nick >>>>>>>>>> njcoxstata@gmail.com >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 6 June 2013 14:36, Li Chuntao (Tony) <leechtcn@gmail.com> wrote: >>>>>>>>>>> Dear Listers, >>>>>>>>>>> >>>>>>>>>>> I want to import the following HTML source files: >>>>>>>>>>> >>>>>>>>>>> http://qq.ico.la/qq459322466.html >>>>>>>>>>> >>>>>>>>>>> The source file contains some information in Chinese, which is >>>>>>>>>>> located in line 32 to 73. >>>>>>>>>>> >>>>>>>>>>> i tried to import the information by using the following code: >>>>>>>>>>> >>>>>>>>>>> clear all >>>>>>>>>>> set obs 500 >>>>>>>>>>> copy "http://qq.ico.la/qq459322466.html"; d:\qq.txt, replace >>>>>>>>>>> >>>>>>>>>>> mata: >>>>>>>>>>> fh = fopen("d:\qq.txt", "r") >>>>>>>>>>> for(i=1; i<=34; i++) { >>>>>>>>>>> junk=fget(fh) >>>>>>>>>>> } >>>>>>>>>>> for(i=; i<=20; i++) { >>>>>>>>>>> junk=fget(fh) >>>>>>>>>>> junk >>>>>>>>>>> } >>>>>>>>>>> >>>>>>>>>>> end >>>>>>>>>>> >>>>>>>>>>> but the result data in memory is only a messy. >>>>>>>>>>> >>>>>>>>>>> Similar code has been used for other webpage, thanks to Prof. Kit >>>>>>>>>>> Baum, as can be seen following: >>>>>>>>>>> >>>>>>>>>>> clear all >>>>>>>>>>> set obs 500 >>>>>>>>>>> local stkcd="000002" >>>>>>>>>>> gen str20 date="2012.12.31" >>>>>>>>>>> copy "http://stockdata.stock.hexun.com/2008/lr.aspx?stockid=`stkcd'&accountdate=2012.12.31" >>>>>>>>>>> d:\date.txt, replace >>>>>>>>>>> mata: >>>>>>>>>>> fh = fopen("d:\date.txt", "r") >>>>>>>>>>> for(i=1; i<=444; i++) { >>>>>>>>>>> junk=fget(fh) >>>>>>>>>>> } >>>>>>>>>>> >>>>>>>>>>> Can someone familiar with Chinese encoding give me some hits? >>>>>>>>>>> >>>>>>>>>>> Best >>>>>>>>>>> >>>>>>>>>>> Chuntao >>>>>>>>>>> * >>>>>>>>>>> * For searches and help try: >>>>>>>>>>> * http://www.stata.com/help.cgi?search >>>>>>>>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>>>>>>>>>> * http://www.ats.ucla.edu/stat/stata/ >>>>>>>>>> * >>>>>>>>>> * For searches and help try: >>>>>>>>>> * http://www.stata.com/help.cgi?search >>>>>>>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>>>>>>>>> * http://www.ats.ucla.edu/stat/stata/ >>>>>>>>> * >>>>>>>>> * For searches and help try: >>>>>>>>> * http://www.stata.com/help.cgi?search >>>>>>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>>>>>>>> * http://www.ats.ucla.edu/stat/stata/ >>>>>>>> * >>>>>>>> * For searches and help try: >>>>>>>> * http://www.stata.com/help.cgi?search >>>>>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>>>>>>> * http://www.ats.ucla.edu/stat/stata/ >>>>>>> * >>>>>>> * For searches and help try: >>>>>>> * http://www.stata.com/help.cgi?search >>>>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>>>>>> * http://www.ats.ucla.edu/stat/stata/ >>>>>> * >>>>>> * For searches and help try: >>>>>> * http://www.stata.com/help.cgi?search >>>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>>>>> * http://www.ats.ucla.edu/stat/stata/ >>>>> * >>>>> * For searches and help try: >>>>> * http://www.stata.com/help.cgi?search >>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>>>> * http://www.ats.ucla.edu/stat/stata/ >>>> * >>>> * For searches and help try: >>>> * http://www.stata.com/help.cgi?search >>>> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>>> * http://www.ats.ucla.edu/stat/stata/ >>> * >>> * For searches and help try: >>> * http://www.stata.com/help.cgi?search >>> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>> * http://www.ats.ucla.edu/stat/stata/ >> * >> * For searches and help try: >> * http://www.stata.com/help.cgi?search >> * http://www.stata.com/support/faqs/resources/statalist-faq/ >> * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/

