Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: RE: chunky errors


From   "Dimitriy V. Masterov" <[email protected]>
To   [email protected]
Subject   Re: st: RE: chunky errors
Date   Sat, 20 Nov 2010 19:07:51 -0500

Thanks for your suggestion, Buzz. I'll give gsplit a whirl tomorrow morning.

DVM

P.S. That a lot of cows you got to keep track of!

On Sat, Nov 20, 2010 at 6:12 PM, Buzz Burhans <[email protected]> wrote:
> Dimitriy,
>
> I can't help you with the -chunky- error, but might have an alternative.  I
> recently had to split a large csv file with 1.7 million records; I
> successfully used a freeware utility (standalone, not inside  Stata) called
> "gsplit".  It allowed me to split by line, so I could keep records together,
> and it allowed me to add a header to each split subfile, which made
> importing into Stata easier.
>
> Gsplit is available for download at:
>
> http://www.gdgsoft.com/gsplit/
>
> I also looked at, but ended up not using, file splitter, available as a
> download on CNET:
>
> http://download.cnet.com/File-Splitter/3000-2248_4-10405033.html
>
> Good luck
>
>
> Buzz Burhans, Ph.D.
>
> Dairy-Tech Group
> So. Albany, VT / Twin Falls ID
>
> Cell: 208-320-0829
> ID Fax: 208-735-1289
> VT Fax: 802-755-6842
>
> Email: [email protected]
>
>
> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]] On Behalf Of Dimitriy V.
> Masterov
> Sent: Saturday, November 20, 2010 3:41 PM
> To: Statalist
> Subject: st: chunky errors
>
> I am running Stata/MP 11.1 on 64-bit Windows 7 laptop with 8 gigs of
> RAM.  I am trying to read in 10.9 GB csv file by breaking it up using
> chunky. I am getting an error message that I can't interpret. The file
> has 5,320,745 rows and 387 columns. It's got some weird ascii
> characters in beginning of the first line (ï,»,¿), but I can't open
> the file in any text editor to edit them out (vi crashes even when I
> try to edit the first line with ex). I don't know if this is what's
> causing the problem.
>
> The first chunk file rpd_1.txt is created, but it is empty.
>
> Here's what I am doing:
>
>> set mem 6g;
>
> Current memory allocation
>
>                    current                                 memory usage
>    settable          value     description                 (1M = 1024k)
>    --------------------------------------------------------------------
>    set maxvar         5000     max. variables allowed           1.947M
>    set memory         6144M    max. data space              6,144.000M
>    set matsize         400     max. RHS vars in models          1.254M
>                                                            -----------
>                                                             6,147.201M
>
> . chunky using raw_price_data.csv, peek(2);
>
> Peeking at the first 2 lines of raw_price_data.csv
>
>
> 1003,2008-03-23,0,Disc,NULL,7.99,7.87678139240359,8.35,8.56885514506366,1
> 0.7974394069005,NULL,10.2566666666667,6.99,6.99,7.99,7.77333333333333,10.36
>>
> 5,NULL,NULL,5.99,5.99,5.99,NULL,9.29,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL
> ,NULL,6.32666666666667,6.96559727916511,8.04833333333333,8.70765897152336,1
>>
> 0.4373620377794,NULL,NULL,6.99,6.47333333333333,6.99,6.80999783830523,10.071
> 0149156939,6.19008481948673,7.61715989563921,NULL,NULL,NULL,NULL,NULL,7.99,
>>
> NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,7,7.99115044247788,NULL,11.19,N
> ULL,NULL,5,5,5.675,NULL,NULL,NULL,NULL,5.99,5.99,NULL,NULL,NULL,NULL,NULL,N
>>
> ULL,NULL,NULL,NULL,NULL,NULL,NULL,7.99,5.096875,NULL,NULL,7.99,NULL,NULL,NUL
> L,4.52849423193685,NULL,7.26876442015786,NULL,NULL,NULL,NULL,NULL,NULL,NULL
>>
> ,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,2.52905297740543,2.280611
> 20807602,2.29737689297299,NULL,4.29484076024002,NULL,NULL,4.99,NULL,0.69866
>>
> 8055269689,1.70842964884469,NULL,1003,2008-03-23,0,Disc,NULL,2.07,2.34533333
> 333333,2.42321376851818,2.68166666666667,2.92333333333333,NULL,2.7948205128
>>
> 2051,0.99,1.22441441441441,1.58435261707989,1.9752380952381,2.295,NULL,NULL,
> 1.056,1.415,1.305,NULL,2.07,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,2.
>>
> 06415557325445,2.26902502939606,2.59914202498692,2.87889430894309,3.16111111
> 111111,NULL,NULL,1.11586088873419,1.29769942196532,1.38903947368421,1.65,1.
>>
> 645,1.57740131578947,1.53326226012793,NULL,NULL,NULL,NULL,NULL,1.56,NULL,NUL
> L,NULL,NULL,NULL,NULL,NULL,NULL,NULL,2.4115,2.55,NULL,2.385,NULL,NULL,0.975
>>
> ,1.27613965826475,1.2675,NULL,NULL,NULL,NULL,1.10231092436975,1.282689075630
> 25,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,2.0681468218
>>
> 4423,2.44601506503719,NULL,NULL,2.41783348254252,NULL,NULL,NULL,1.2828851963
> 7462,NULL,1.84778700906344,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NUL
>>
> L,NULL,NULL,NULL,NULL,NULL,NULL,NULL,0.749815048945067,0.646367713004484,0.5
> 15949492992582,NULL,2.4335277819483,NULL,NULL,2.6,NULL,0.570616883116883,1.
>>
> 11963210197913,NULL,1003,2008-03-23,0,Disc,NULL,2,15,10,6,3,NULL,3,3,3,3,3,2
> ,NULL,NULL,5,2,4,NULL,1,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,3,6,12
>>
> ,4,10,NULL,NULL,3,3,5,1,2,2,1,NULL,NULL,NULL,NULL,NULL,1,NULL,NULL,NULL,NULL
> ,NULL,NULL,NULL,NULL,NULL,2,1,NULL,1,NULL,NULL,2,16,4,NULL,NULL,NULL,NULL,1
>>
> ,1,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,1,5,NULL,NULL
> ,1,NULL,NULL,NULL,5,NULL,2,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NUL
>>
> L,NULL,NULL,NULL,NULL,NULL,NULL,NULL,11,3,10,NULL,6,NULL,NULL,1,NULL,3,9,NUL
> L0d0a
>
> 1003,2008-03-23,0,Free,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NUL
> L,NULL,NULL,NULL,NULL,0,0,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL
>>
> ,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,
> NULL,0,0,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,N
>>
> ULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NU
> LL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NU
>>
> LL,NULL,NULL,NULL,0,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,N
> ULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,0,NULL,0,NULL,0,NULL,NULL,NULL,
>>
> NULL,NULL,NULL,NULL,1003,2008-03-23,0,Free,NULL,NULL,NULL,NULL,NULL,NULL,NUL
> L,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,1.1,1.28,NULL,NULL,NULL,NULL,NULL
>>
> ,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,
> NULL,NULL,NULL,NULL,NULL,NULL,1.08457627118644,1.28542372881356,NULL,NULL,N
>>
> ULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NU
> LL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NU
>>
> LL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,0.9
> 9,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NUL
>>
> L,NULL,NULL,NULL,NULL,NULL,0.749580536912752,NULL,0.475,NULL,2.2225167785234
> 9,NULL,NULL,NULL,NULL,NULL,NULL,NULL,1003,2008-03-23,0,Free,NULL,NULL,NULL,
>>
> NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,1,1,NULL,NULL,NU
> LL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NU
>>
> LL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,1,1,NULL,NULL,NULL,NULL,NULL,NULL
> ,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL
>>
> ,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,
> NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,2,NULL,NULL,NULL,NUL
>>
> L,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL
> ,NULL,6,NULL,2,NULL,1,NULL,NULL,NULL,NULL,NULL,NULL,NULL0d0a
>
> (for reference: End of line characters odoa (CRLF) indicate Windows,
> oa (LF) Unix and od (CR) Mac.
> 09 is the TAB character.)
>
>
> .  chunky using raw_price_data.csv, analyze;
>
> Analyzing raw_price_data.csv for chunking
>
> EXTENDED ASCII is the file type
> File has 5320745 lines of average length 2214 bytes
> Composition is 52% letters, 28% numbers and 20% other characters
> Extended characters are present and may cause problems.
>
> Extended characters found:
> +-----------------+
> |ASCII |  count   |
> |------+----------|
> |187   |         1|
> |191   |         1|
> |239   |         1|
> +-----------------+
>
> Approximate chunk sizes and memory requirements
> for -insheet- or -infile- commands
> +-----------------------------------------------------------+
> |Chunksize (mb)|  Number of   |   ~Number    | Stata size*  |
> |    option    |    Chunks    |  obs/chunk   | (megabytes)  |
> |--------------+--------------+--------------+--------------|
> |          10  |        1179  |        4513  |        10.2  |
> |          30  |         393  |       13539  |        30.5  |
> |         100  |         118  |       45091  |       101.5  |
> |         300  |          40  |      133019  |       299.5  |
> |        1000  |          12  |      443395  |998.3000000000001  |
> |        3000  |           4  |     1330186  |      2994.8  |
> +-----------------------------------------------------------+
> * Stata file size is very approximate and depends on datatypes of variables
>
> Further detail available by running hexdump `"raw_price_data.csv"',
> analyze results
>
> . hexdump raw_price_data.csv, analyze results;
>
>  Line-end characters                        Line length (tab=1)
>    \r\n         (Windows)      5,320,745      minimum
> 1,777
>    \r by itself (Mac)                  0      maximum
> 3,547
>    \n by itself (Unix)                 0
>  Space/separator characters                 Number of lines
> 5,320,745
>    [blank]                             0      EOL at EOF?
> yes
>    [tab]                               0
>    [comma] (,)             2,053,807,570    Length of first 5 lines
>  Control characters                           Line 1
> 2,494
>    binary 0                            0      Line 2
> 1,945
>    CTL excl. \r, \n, \t                0      Line 3
> 2,039
>    DEL                                 0      Line 4
> 2,620
>    Extended (128-159,255)              0      Line 5
> 1,920
>  ASCII printable
>    A-Z                     6,071,073,584
>    a-z                        47,886,705    File format         EXTENDED
> ASCII
>    0-9                     3,264,399,987
>    Special (!@#$ etc.)       334,050,065
>    Extended (160-254)                  3
>                          ---------------
>  Total                       11781859404
>
>  Observed were:
>     \n \r , - . 0 1 2 3 4 5 6 7 8 9 D E F L M N U c e i n r s u » ¿ ï
>
> . chunky using raw_price_data.csv, chunksize(3000m) stub(rpd_) replace;
>
> Chunking using the following settings:
>
> Chunksize:    3,000,000,000
> Memory:       6,442,450,944
> Bites:                    1
> Bitesize:     3,000,000,000
>
>                 fread():  3300  argument out of range
>             chunkfile():     -  function returned error [46]
>                 <istmt>:     -  function returned error [1]
> r(3300);
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index