Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | "Buzz Burhans" <buzzb3@earthlink.net> |
To | <statalist@hsphsun2.harvard.edu> |
Subject | st: RE: chunky errors |
Date | Sat, 20 Nov 2010 16:12:53 -0700 |
Dimitriy, I can't help you with the -chunky- error, but might have an alternative. I recently had to split a large csv file with 1.7 million records; I successfully used a freeware utility (standalone, not inside Stata) called "gsplit". It allowed me to split by line, so I could keep records together, and it allowed me to add a header to each split subfile, which made importing into Stata easier. Gsplit is available for download at: http://www.gdgsoft.com/gsplit/ I also looked at, but ended up not using, file splitter, available as a download on CNET: http://download.cnet.com/File-Splitter/3000-2248_4-10405033.html Good luck Buzz Burhans, Ph.D. Dairy-Tech Group So. Albany, VT / Twin Falls ID Cell: 208-320-0829 ID Fax: 208-735-1289 VT Fax: 802-755-6842 Email: buzzb3@earthlink.net -----Original Message----- From: owner-statalist@hsphsun2.harvard.edu [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of Dimitriy V. Masterov Sent: Saturday, November 20, 2010 3:41 PM To: Statalist Subject: st: chunky errors I am running Stata/MP 11.1 on 64-bit Windows 7 laptop with 8 gigs of RAM. I am trying to read in 10.9 GB csv file by breaking it up using chunky. I am getting an error message that I can't interpret. The file has 5,320,745 rows and 387 columns. It's got some weird ascii characters in beginning of the first line (ï,»,¿), but I can't open the file in any text editor to edit them out (vi crashes even when I try to edit the first line with ex). I don't know if this is what's causing the problem. The first chunk file rpd_1.txt is created, but it is empty. Here's what I am doing: > set mem 6g; Current memory allocation current memory usage settable value description (1M = 1024k) -------------------------------------------------------------------- set maxvar 5000 max. variables allowed 1.947M set memory 6144M max. data space 6,144.000M set matsize 400 max. RHS vars in models 1.254M ----------- 6,147.201M . chunky using raw_price_data.csv, peek(2); Peeking at the first 2 lines of raw_price_data.csv 1003,2008-03-23,0,Disc,NULL,7.99,7.87678139240359,8.35,8.56885514506366,1 0.7974394069005,NULL,10.2566666666667,6.99,6.99,7.99,7.77333333333333,10.36 > 5,NULL,NULL,5.99,5.99,5.99,NULL,9.29,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL ,NULL,6.32666666666667,6.96559727916511,8.04833333333333,8.70765897152336,1 > 0.4373620377794,NULL,NULL,6.99,6.47333333333333,6.99,6.80999783830523,10.071 0149156939,6.19008481948673,7.61715989563921,NULL,NULL,NULL,NULL,NULL,7.99, > NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,7,7.99115044247788,NULL,11.19,N ULL,NULL,5,5,5.675,NULL,NULL,NULL,NULL,5.99,5.99,NULL,NULL,NULL,NULL,NULL,N > ULL,NULL,NULL,NULL,NULL,NULL,NULL,7.99,5.096875,NULL,NULL,7.99,NULL,NULL,NUL L,4.52849423193685,NULL,7.26876442015786,NULL,NULL,NULL,NULL,NULL,NULL,NULL > ,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,2.52905297740543,2.280611 20807602,2.29737689297299,NULL,4.29484076024002,NULL,NULL,4.99,NULL,0.69866 > 8055269689,1.70842964884469,NULL,1003,2008-03-23,0,Disc,NULL,2.07,2.34533333 333333,2.42321376851818,2.68166666666667,2.92333333333333,NULL,2.7948205128 > 2051,0.99,1.22441441441441,1.58435261707989,1.9752380952381,2.295,NULL,NULL, 1.056,1.415,1.305,NULL,2.07,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,2. > 06415557325445,2.26902502939606,2.59914202498692,2.87889430894309,3.16111111 111111,NULL,NULL,1.11586088873419,1.29769942196532,1.38903947368421,1.65,1. > 645,1.57740131578947,1.53326226012793,NULL,NULL,NULL,NULL,NULL,1.56,NULL,NUL L,NULL,NULL,NULL,NULL,NULL,NULL,NULL,2.4115,2.55,NULL,2.385,NULL,NULL,0.975 > ,1.27613965826475,1.2675,NULL,NULL,NULL,NULL,1.10231092436975,1.282689075630 25,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,2.0681468218 > 4423,2.44601506503719,NULL,NULL,2.41783348254252,NULL,NULL,NULL,1.2828851963 7462,NULL,1.84778700906344,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NUL > L,NULL,NULL,NULL,NULL,NULL,NULL,NULL,0.749815048945067,0.646367713004484,0.5 15949492992582,NULL,2.4335277819483,NULL,NULL,2.6,NULL,0.570616883116883,1. > 11963210197913,NULL,1003,2008-03-23,0,Disc,NULL,2,15,10,6,3,NULL,3,3,3,3,3,2 ,NULL,NULL,5,2,4,NULL,1,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,3,6,12 > ,4,10,NULL,NULL,3,3,5,1,2,2,1,NULL,NULL,NULL,NULL,NULL,1,NULL,NULL,NULL,NULL ,NULL,NULL,NULL,NULL,NULL,2,1,NULL,1,NULL,NULL,2,16,4,NULL,NULL,NULL,NULL,1 > ,1,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,1,5,NULL,NULL ,1,NULL,NULL,NULL,5,NULL,2,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NUL > L,NULL,NULL,NULL,NULL,NULL,NULL,NULL,11,3,10,NULL,6,NULL,NULL,1,NULL,3,9,NUL L0d0a 1003,2008-03-23,0,Free,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NUL L,NULL,NULL,NULL,NULL,0,0,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL > ,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL, NULL,0,0,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,N > ULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NU LL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NU > LL,NULL,NULL,NULL,0,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,N ULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,0,NULL,0,NULL,0,NULL,NULL,NULL, > NULL,NULL,NULL,NULL,1003,2008-03-23,0,Free,NULL,NULL,NULL,NULL,NULL,NULL,NUL L,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,1.1,1.28,NULL,NULL,NULL,NULL,NULL > ,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL, NULL,NULL,NULL,NULL,NULL,NULL,1.08457627118644,1.28542372881356,NULL,NULL,N > ULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NU LL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NU > LL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,0.9 9,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NUL > L,NULL,NULL,NULL,NULL,NULL,0.749580536912752,NULL,0.475,NULL,2.2225167785234 9,NULL,NULL,NULL,NULL,NULL,NULL,NULL,1003,2008-03-23,0,Free,NULL,NULL,NULL, > NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,1,1,NULL,NULL,NU LL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NU > LL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,1,1,NULL,NULL,NULL,NULL,NULL,NULL ,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL > ,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL, NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,2,NULL,NULL,NULL,NUL > L,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL ,NULL,6,NULL,2,NULL,1,NULL,NULL,NULL,NULL,NULL,NULL,NULL0d0a (for reference: End of line characters odoa (CRLF) indicate Windows, oa (LF) Unix and od (CR) Mac. 09 is the TAB character.) . chunky using raw_price_data.csv, analyze; Analyzing raw_price_data.csv for chunking EXTENDED ASCII is the file type File has 5320745 lines of average length 2214 bytes Composition is 52% letters, 28% numbers and 20% other characters Extended characters are present and may cause problems. Extended characters found: +-----------------+ |ASCII | count | |------+----------| |187 | 1| |191 | 1| |239 | 1| +-----------------+ Approximate chunk sizes and memory requirements for -insheet- or -infile- commands +-----------------------------------------------------------+ |Chunksize (mb)| Number of | ~Number | Stata size* | | option | Chunks | obs/chunk | (megabytes) | |--------------+--------------+--------------+--------------| | 10 | 1179 | 4513 | 10.2 | | 30 | 393 | 13539 | 30.5 | | 100 | 118 | 45091 | 101.5 | | 300 | 40 | 133019 | 299.5 | | 1000 | 12 | 443395 |998.3000000000001 | | 3000 | 4 | 1330186 | 2994.8 | +-----------------------------------------------------------+ * Stata file size is very approximate and depends on datatypes of variables Further detail available by running hexdump `"raw_price_data.csv"', analyze results . hexdump raw_price_data.csv, analyze results; Line-end characters Line length (tab=1) \r\n (Windows) 5,320,745 minimum 1,777 \r by itself (Mac) 0 maximum 3,547 \n by itself (Unix) 0 Space/separator characters Number of lines 5,320,745 [blank] 0 EOL at EOF? yes [tab] 0 [comma] (,) 2,053,807,570 Length of first 5 lines Control characters Line 1 2,494 binary 0 0 Line 2 1,945 CTL excl. \r, \n, \t 0 Line 3 2,039 DEL 0 Line 4 2,620 Extended (128-159,255) 0 Line 5 1,920 ASCII printable A-Z 6,071,073,584 a-z 47,886,705 File format EXTENDED ASCII 0-9 3,264,399,987 Special (!@#$ etc.) 334,050,065 Extended (160-254) 3 --------------- Total 11781859404 Observed were: \n \r , - . 0 1 2 3 4 5 6 7 8 9 D E F L M N U c e i n r s u » ¿ ï . chunky using raw_price_data.csv, chunksize(3000m) stub(rpd_) replace; Chunking using the following settings: Chunksize: 3,000,000,000 Memory: 6,442,450,944 Bites: 1 Bitesize: 3,000,000,000 fread(): 3300 argument out of range chunkfile(): - function returned error [46] <istmt>: - function returned error [1] r(3300); * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/