Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down at the end of May, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Importing subset of a pipe delimited textfile


From   Maarten Buis <maartenlbuis@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Importing subset of a pipe delimited textfile
Date   Wed, 17 Oct 2012 14:04:17 +0200

I noticed that you did not want to use -insheet- as it does not allow
the -in- qualifier because your data is too big. You can handle big
data in chunks in -insheet- by specifying variables, and when
necessary merge the sub-files later after making the sub-files smaller
by getting rid of the strings with -encode- followed by -compress-.

--Maarten

On Wed, Oct 17, 2012 at 1:50 PM, Maarten Buis <maartenlbuis@gmail.com> wrote:
> To give a concrete example: I stored Rob's example dataset in foo.raw
>
> I than typed in Stata:
>
> filefilter foo.raw foo2.raw, from("|") to(\t) replace
>
> insheet using foo2.raw
>
> The first line replaced all pipes in the file foo.raw with a tab and
> stored the resulting tab-delimited file in foo2.raw, and the second
> line read this tab-delimited file foo2.raw into Stata.
>
> Hope this helps,
> Maarten
>
> On Wed, Oct 17, 2012 at 1:37 PM, Nick Cox <njcoxstata@gmail.com> wrote:
>> Why is varying length of line a problem? So long as the same variables
>> are represented on each line, I can see no problem.
>>
>> Also, -filefilter- has a tacit loop; you don't need to set it up for yourself.
>>
>> Nick
>>
>> On Wed, Oct 17, 2012 at 12:33 PM, Rob Shaw <rob.shaw.uk@gmail.com> wrote:
>>> Nick
>>>
>>> Thanks. Yes that would work but the problem is the varying length of
>>> each line. So I need to get filefilter or another command to do one
>>> of:
>>>
>>> x=0
>>> counter=1
>>> with "myfile.txt" {
>>>  y = position of 10000th EOL in `i'
>>>  save `i' from position x to y in "myfilepos"+counter+".txt"
>>>  x =y
>>> }
>>>
>>> This would create files called myfilepos1, myfilepos2 etc each with
>>> 10000 lines that I could then -insheet- with a delimiter(|) option.
>>> But I don't know how to correctly specify the bit in the loop.
>>>
>>> OR
>>>
>>> for each line in "myfile.txt" {
>>>  find | and replace with a number of spaces depending on position in row
>>> }
>>>
>>> This would make each line the same length so I could use -infile-
>>>
>>> Is there a way to use -filefilter- to achieve this?
>>>
>>> File sample:
>>>
>>> 1|ABCD|23|XYZ
>>> 10|BCED|1|YZX
>>> 30|DCHS|234|YBH
>>> ....
>>>
>>> Thanks
>>> Rob
>>>
>>>
>>>>I'd use -filefilter- to change the pipes to something that -infile- can handle.
>>>
>>>>(Strictly, -in- is a qualifier, not an option.)
>>>
>>>>Nick
>>>
>>>>On Wed, Oct 17, 2012 at 9:13 AM, Rob Shaw <rob.shaw.uk@gmail.com> wrote:
>>>
>>>> I have a very large (around 4Gb) text file that has been pipe
>>>> delimited. It won't all fit in memory so I want to process it in
>>>> parts.
>>>>
>>>> For fixed datasets I would use infile with the in 1/10000000 option
>>>> then 10000001/2000000 etc. However, this dataset has been pipe
>>>> delimited so I would need to use insheet, but insheet doesn't seem to
>>>> permit the "in" option.
>>> *
>>> *   For searches and help try:
>>> *   http://www.stata.com/help.cgi?search
>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>> *   http://www.ats.ucla.edu/stat/stata/
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> *   http://www.ats.ucla.edu/stat/stata/
>
>
>
> --
> ---------------------------------
> Maarten L. Buis
> WZB
> Reichpietschufer 50
> 10785 Berlin
> Germany
>
> http://www.maartenbuis.nl
> ---------------------------------



-- 
---------------------------------
Maarten L. Buis
WZB
Reichpietschufer 50
10785 Berlin
Germany

http://www.maartenbuis.nl
---------------------------------
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index