Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: insheet and dropping cases


From   Sergiy Radyakin <serjradyakin@gmail.com>
To   "statalist@hsphsun2.harvard.edu" <statalist@hsphsun2.harvard.edu>
Subject   Re: st: insheet and dropping cases
Date   Thu, 20 Feb 2014 15:21:02 -0500

Hello Ben,
report is helpful and it is safe to post it as it is Stata's output,
which doesn't have anything unprintable. Note how Stata writes an
escaped sequence \n and \r for unprintable characters 10 and 13. For
description of unprintable ASCII characters and their role in control
of the text see eg the following page:
http://www.juniper.net/techpubs/en_US/idp5.1/topics/reference/general/intrusion-detection-prevention-custom-attack-object-extended-ascii.html
or google them (plenty of links); most of them are archaic.

We focus on the 0-31 range. You only have 10 and 13, which is a
typical end-of-line pattern \r\n. There are no gremlins to zap, so to
speak. Also the /r and /n are having the same frequency, which means
that they are also likely to be properly paired at the end of the
line.

There is also nothing in the upper page (non-ASCII) characters 128-255.

To be sure that the report itself is correct, verify that the total
file length as reported by OS is the sum of frequencies of all
characters (394,625).

I note the use of the single and double quotes to denote minutes in
the coordinates. Perhaps this can confuse Stata. In some records you
posted I see "MIN" as a word, in some cases it is ". When seeing a
quote, even in a something-separated file, Stata would seek to the end
of the string, which could be a long way from where the quote has
opened. If you expect quotes to denote seconds and single quotes
minutes, do the filefilter for them in advance, and retry.

Hope this helps, Sergiy Radyakin





On Thu, Feb 20, 2014 at 2:19 PM, Ben Hoen <bhoen@lbl.gov> wrote:
> Hi Sergiy,
>
> I am pasting in the tabulate from hexdump (not knowing how to provide a link
> to those files as you suggest):
>
> Tabulation (character not listed if     unobserved):
> Dec Hex  Char        Frequency
> ------------------------------
> 010  0a  \n                364
> 013  0d  \r                364
> 032  20  blank           9,621
> 033  21  !                   9
> 034  22  "                   5
> 035  23  #                  21
> 038  26  &                 202
> 039  27  '                 135
> 040  28  (                  30
> 041  29  )                  29
> 042  2a  *                   4
> 043  2b  +                   7
> 044  2c  ,                 112
> 045  2d  -               3,378
> 046  2e  .                 282
> 047  2f  /                 337
> 048  30  0             157,131
> 049  31  1              18,056
> 050  32  2              13,187
> 051  33  3               8,837
> 052  34  4               8,087
> 053  35  5               6,803
> 054  36  6               7,456
> 055  37  7               6,283
> 056  38  8               6,322
> 057  39  9               6,333
> 058  3a  :                  98
> 059  3b  ;                  24
> 064  40  @                   1
> 065  41  A               5,418
> 066  42  B               1,197
> 067  43  C               3,167
> 068  44  D               2,399
> 069  45  E               5,718
> 070  46  F               1,067
> 071  47  G               1,612
> 072  48  H               1,597
> 073  49  I               4,112
> 074  4a  J                 300
> 075  4b  K                 873
> 076  4c  L               4,877
> 077  4d  M               1,693
> 078  4e  N               4,099
> 079  4f  O               4,254
> 080  50  P               1,343
> 081  51  Q                 149
> 082  52  R               4,634
> 083  53  S               3,272
> 084  54  T               3,756
> 085  55  U               1,162
> 086  56  V                 865
> 087  57  W               1,488
> 088  58  X                 151
> 089  59  Y               1,369
> 090  5a  Z                  67
> 095  5f  _                 726
> 097  61  a                 817
> 098  62  b                  73
> 099  63  c                 147
> 100  64  d                 323
> 101  65  e                 887
> 102  66  f                  65
> 103  67  g                 199
> 104  68  h                 189
> 105  69  i                 498
> 107  6b  k                 233
> 108  6c  l                 419
> 109  6d  m                 111
> 110  6e  n                 616
> 111  6f  o                 872
> 112  70  p                 107
> 113  71  q                   4
> 114  72  r                 581
> 115  73  s                 252
> 116  74  t                 390
> 117  75  u                 132
> 118  76  v                 172
> 119  77  w                  74
> 120  78  x                  13
> 121  79  y                  90
> 122  7a  z                   7
> 124  7c  |             72,436
> 125  7d  }                  35
> ------------------------------
> Total                  394,625
>
> It is not clear to me what the problem characters - unprintable/special or
> not - but I tried replacing the "}" character (and the comma previously) to
> no avail.
>
> Separately I think I isolated the fields that contain the problems.  Is
> there a way to ignore/remove individual fields in a txt file from within
> Stata?
>
> Thank you for your efforts in helping me with this issue.
>
> Ben
>
> Ben Hoen
> LBNL
> Office: 845-758-1896
> Cell: 718-812-7589
>
>
> -----Original Message-----
> From: owner-statalist@hsphsun2.harvard.edu
> [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of Sergiy Radyakin
> Sent: Thursday, February 20, 2014 1:28 PM
> To: statalist@hsphsun2.harvard.edu
> Subject: Re: st: insheet and dropping cases
>
> Ben,
>
> -- the problem is likely caused by presence of unprintable characters
> in the file, that are tolerated by StatTransfer, but not by Stata;
>
> -- character with ASCII code 255 is a usual suspect;
>
> -- pasting raw data to statalist is likely not to reveal the problem,
> since the special characters might not survive massaging throw emails;
>
> -- isolating the problem in the text editor into a new file could help
> (keep the last record read in correctly and one immediately after),
> then make the file available through a link, to retain its binary
> structure, not all text editors will retain special chars on save;
>
> -- use hexdump "file" , analyze tabulate to see unprintable
> characters, then search for them in the file or use filefilter;
>
> -- see "zap gremlins" for relevant tactic.
>
> On the bright side: you are lucky you have 363 cases. Last time I had
> this problem, only 16gb out of 40gb were read in. Try to open that
> file in the notepad :)
>
> Hope this helps.
>
> Best, Sergiy Radyakin
>
>
> On Thu, Feb 20, 2014 at 12:34 PM, Radwin, David <dradwin@rti.org> wrote:
>> One other possibility is to use -inputst-, a Stata program that calls
> Stat/Transfer (part of -stcmd- by Roger Newson and available at SSC).
>>
>> This workaround is probably less computationally efficient than the
> suggestions from others, but since you already know that Stat/Transfer
> works, this approach might be faster and easier than trying to figure out
> the problem with your text files and -insheet- or -import delimited-.
>>
>> David
>> --
>> David Radwin, Senior Research Associate
>> Education and Workforce Development
>> RTI International
>> 2150 Shattuck Ave. Suite 800, Berkeley, CA 94704
>> Phone: 510-665-8274
>>
>> www.rti.org/education
>>
>>
>>> -----Original Message-----
>>> From: owner-statalist@hsphsun2.harvard.edu [mailto:owner-
>>> statalist@hsphsun2.harvard.edu] On Behalf Of Phil Schumm
>>> Sent: Thursday, February 20, 2014 6:38 AM
>>> To: Statalist Statalist
>>> Subject: Re: st: insheet and dropping cases
>>>
>>> On Feb 20, 2014, at 8:28 AM, Ben Hoen <bhoen@lbl.gov> wrote:
>>> > Hexdump I had never used.  This is what it returned:
>>>
>>> <snip>
>>>
>>> > Do you see anything suspicious here?  (I replaced all the commas with
>>> "_", using filefilter - another great suggestion -  wondering if that was
>>> causing any issues and insheet still returned 184 observations.)
>>>
>>>
>>> I don't see anything obvious -- you'll need to look at the file directly.
>>> Is Stata reading the first 184 observations, or are the 184 observations
>>> from different places in the file?  Check that first, and if you are
>>> getting the first 184 observations, then look at lines 184-6 (depending
> on
>>> whether the file has a header line).  Something has to be going on there.
>>>
>>>
>>> -- Phil
>>
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> *   http://www.ats.ucla.edu/stat/stata/
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index