Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

RE: st:How to input a portion of a file


From   "Nick Cox" <n.j.cox@durham.ac.uk>
To   <statalist@hsphsun2.harvard.edu>
Subject   RE: st:How to input a portion of a file
Date   Thu, 21 Feb 2008 16:47:56 -0000

As far as I am concerned, -insheet- is making a guess. You might object
to the word, but I would be happy if the word were replaced by another.
If you want to be formal, and talk about algorithm or rules, fine. 

I say guess, because -insheet- must jump one way or the other if
evidence is contradictory. 

As two very simple examples, test.txt that is 

some 
header 
material
1 2 3 

and -insheet using test.txt, nonames clear- yields one string variable
with 4 observations, while the same file with contents 

some header material in six words 
1 2 3

yields one string variable with 2 observations. In neither of these
examples is the file consistent, and -insheet- makes assumptions about
what is intended. 


I can't comment on how many passes -insheet- makes. What you say here is
presumably based on documentation I have not read, or documentation I
have read but forgotten, or something you have worked out yourself. No
doubt various people would be interested if you explained which. 

Nick
n.j.cox@durham.ac.uk 

Sergiy Radyakin

Stata does not "guess". Stata determines the variable types using a
"scientific method of looking". Stata looks not "at the early bit of
the file" - it looks at the whole file. This is called "the first
pass". After the variable types are determined - the file can be read
in - that is called "the second pass". Users of StatTransfer will be
familiar with this technique - StatTransfer will do two passes over
your data and is very explicit at showing it's progress.

It remains unclear, however, what the third pass in the Stata's
-insheet- procedure is for. It could be a simple ineffeciency of code,
or it could be something else, which I don't see at the moment, which
necessitates the third pass (and this is more probable, since even the
most recent version does so).

The fact is however, that Stata will fully read the file 3 times when
importing from text format. If the file is already in dta format, one
pass is enough, and here Stata is very fast.

On 2/21/08, Nick Cox <n.j.cox@durham.ac.uk> wrote:
> That's an instructive example.
>
> As I understand it, -insheet- peeks at the early bit of the file,
makes
> a guess at the number and type of variables, and assigns accordingly.
> Whether guessing will also reliably give a workable answer with Joseph
> Wagner's files, I can't say.
>
> Nick
> n.j.cox@durham.ac.uk
>
> Friedrich Huebler
>
> Assume we have a file "test.txt" that contains the following text
> (without the Start and End lines). We are only interested in the
> numbers.
>
> === Start of file ===
> I am not clear how that this will help, as the header text and
> the remainder of the file will give -insheet- quite different
> ideas about what variables there are.
> mpg trunk turn
> 22 11 40
> 17 11 40
> 22 12 35
> 20 16 40
> === End of file ===
>
> Let's import the data with -insheet-.
>
> . insheet using test.txt, nonames delimiter(" ")
> (14 vars, 8 obs)
> . drop if _n < 5
> (4 observations deleted)
> . drop v4 - v14
> . list
>
>     +--------------+
>     | v1   v2   v3 |
>     |--------------|
>  1. | 22   11   40 |
>  2. | 17   11   40 |
>  3. | 22   12   35 |
>  4. | 20   16   40 |
>     +--------------+
>
> Friedrich
>
> On Wed, Feb 20, 2008 at 6:35 AM, Nick Cox <n.j.cox@durham.ac.uk>
wrote:
> > I am not clear how that this will help, as the header text and the
> >  remainder of the file will give -insheet- quite different ideas
about
> >  what variables there are.
> >
> >
> >  Nick
> >  n.j.cox@durham.ac.uk
> >
> >  Friedrich Huebler
> >
> >
> >  You wrote that -insheet- with subsequent deletion of unwanted data
is
> >  "sloppy". That approach might still be the easiest if all files
have
> >  the same structure and your data always appear in the same columns.
> >
> >  . insheet using filename, nonames
> >  . drop if _n < 30 | _n > 129
> >  . drop v1 - v20 v25 - v30
> >
> >
> >
> > On Feb 18, 2008 9:26 AM, Joseph Wagner <joseph.wagner@wright.edu>
> wrote:
> >  > I have data I wish to input a portion of into STATA.  Data is
> >  collected
> >  > on patients by a machine that measures their gait as they walk.
A
> >  text
> >  > file is output for each patient with columns representing
variables
> >  > (each about 130 lines long) but the multiple observation data
> doesn't
> >  > start until line 29.  The first 28 lines are taken up with short
> lines
> >  > of data describing the patient.  Unfortunately, I also need a
> couple
> >  of
> >  > those lines in 'header' area.  The 29th line has the variables
> names
> >  but
> >  > they do not line up directly with the columns of data so I
figured
> I
> >  > could just label the data later.  The data I need starts 30 lines
> down
> >  > at column 115 and includes the next 4 columns and goes down 100
> lines.
> >  >
> >  > I realize there are easier ways to do this but I have data on
about
> >  300
> >  > patients (and so one file for each person) and wanted to automate
> this
> >  > input (followed by successive merging of files to get my final
> >  dataset).
> >  >
> >  > I wanted to use the -infix- command but have never used this
> command
> >  > before and my attempts so far have failed.  I also tried using
> >  -infile-
> >  > with the _first(30) option and the _line(30) option but those
> didn't
> >  > seem to work either.
> >  >
> >  > Here is a dictionary I attempted with just one of the variables:
> >  >
> >  > dictionary using "c:\data\gait\SBS00001_20050607_1.nrm" {
> >  >        _line(30)
> >  >        _column(115) r_grf_vrt_frc %5f
> >  > }
> >  >
> >  > infile using SBS00001_20050607_1.dct
> >  >
> >  > unexpected end of file
> >  > (5 observations read)
> >  >
> >  > The other problem is that it didn't seem to pull the data
> >  corresponding
> >  > to that column.  I thought perhaps there was a problem with the
> data
> >  not
> >  > being in a fixed format but if I try -insheet- all the data
imports
> >  and
> >  > the correct data lines up in the individual columns.  Of course I
> >  could
> >  > write some programming whereby I delete the unneeded variables
and
> >  line
> >  > but that's kind of sloppy.
> >  >
> >  >
> >  >
> >  > I am using STATA ver. 8.2

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index