Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Importing subset of a pipe delimited textfile


From   Maarten Buis <[email protected]>
To   [email protected]
Subject   Re: st: Importing subset of a pipe delimited textfile
Date   Wed, 17 Oct 2012 14:27:39 +0200

I noticed that too late, but already sent my answer to that concern:
rather than splitting your file up horizontally you could split your
file up vertically, i.e. import the variables (or groups of variables)
separately, change the way in which they are stored such that they use
less memory, and than merge the entire file together.

Hope that helps,
Maarten

On Wed, Oct 17, 2012 at 2:10 PM, Rob Shaw <[email protected]> wrote:
> Maarten
>
> The problem is not the pipes as such (otherwise I could just use the
> delimiter options in -insheet-), it's that the file is too large to
> use -insheet-
>
> So i need to use -infile- to import my file in separate parts, but
> infile will only accept fixed format files (as far as I understand).
> Therefore, if I import my file using:
>
> infile str2 var1 _skip(1) str4 var2 _skip(1) str3 var3 _skip(1) str4
> var4  using myfile in 1/1000000
>
> I get nonesense because the first record then gets filled with [1|,
> BCD|, 3|X, YZ]
>
> Rob
>
> Maarten wrote:
>
> To give a concrete example: I stored Rob's example dataset in foo.raw
>
> I than typed in Stata:
>
> filefilter foo.raw foo2.raw, from("|") to(\t) replace
>
> insheet using foo2.raw
>
> The first line replaced all pipes in the file foo.raw with a tab and
> stored the resulting tab-delimited file in foo2.raw, and the second
> line read this tab-delimited file foo2.raw into Stata.
>
> Hope this helps,
> Maarten
>
> On Wed, Oct 17, 2012 at 1:37 PM, Nick Cox <[email protected]> wrote:
>> Why is varying length of line a problem? So long as the same variables
>> are represented on each line, I can see no problem.
>>
>> Also, -filefilter- has a tacit loop; you don't need to set it up for yourself.
>>
>> Nick
>>
>> On Wed, Oct 17, 2012 at 12:33 PM, Rob Shaw <[email protected]> wrote:
>>> Nick
>>>
>>> Thanks. Yes that would work but the problem is the varying length of
>>> each line. So I need to get filefilter or another command to do one
>>> of:
>>>
>>> x=0
>>> counter=1
>>> with "myfile.txt" {
>>>  y = position of 10000th EOL in `i'
>>>  save `i' from position x to y in "myfilepos"+counter+".txt"
>>>  x =y
>>> }
>>>
>>> This would create files called myfilepos1, myfilepos2 etc each with
>>> 10000 lines that I could then -insheet- with a delimiter(|) option.
>>> But I don't know how to correctly specify the bit in the loop.
>>>
>>> OR
>>>
>>> for each line in "myfile.txt" {
>>>  find | and replace with a number of spaces depending on position in row
>>> }
>>>
>>> This would make each line the same length so I could use -infile-
>>>
>>> Is there a way to use -filefilter- to achieve this?
>>>
>>> File sample:
>>>
>>> 1|ABCD|23|XYZ
>>> 10|BCED|1|YZX
>>> 30|DCHS|234|YBH
>>> ....
>>>
>>> Thanks
>>> Rob
>>>
>>>
>>>>I'd use -filefilter- to change the pipes to something that -infile- can handle.
>>>
>>>>(Strictly, -in- is a qualifier, not an option.)
>>>
>>>>Nick
>>>
>>>>On Wed, Oct 17, 2012 at 9:13 AM, Rob Shaw <[email protected]> wrote:
>>>
>>>> I have a very large (around 4Gb) text file that has been pipe
>>>> delimited. It won't all fit in memory so I want to process it in
>>>> parts.
>>>>
>>>> For fixed datasets I would use infile with the in 1/10000000 option
>>>> then 10000001/2000000 etc. However, this dataset has been pipe
>>>> delimited so I would need to use insheet, but insheet doesn't seem to
>>>> permit the "in" option.
>>> *
>>> *   For searches and help try:
>>> *   http://www.stata.com/help.cgi?search
>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>> *   http://www.ats.ucla.edu/stat/stata/
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> *   http://www.ats.ucla.edu/stat/stata/
>
>
>
> --
> ---------------------------------
> Maarten L. Buis
> WZB
> Reichpietschufer 50
> 10785 Berlin
> Germany
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/



-- 
---------------------------------
Maarten L. Buis
WZB
Reichpietschufer 50
10785 Berlin
Germany

http://www.maartenbuis.nl
---------------------------------
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index