Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: st: insheet and dropping cases


From   "Ben Hoen" <[email protected]>
To   <[email protected]>
Subject   RE: st: insheet and dropping cases
Date   Thu, 20 Feb 2014 14:19:09 -0500

Hi Sergiy,

I am pasting in the tabulate from hexdump (not knowing how to provide a link
to those files as you suggest):

Tabulation (character not listed if	unobserved):
Dec Hex  Char        Frequency
------------------------------
010  0a  \n                364
013  0d  \r                364
032  20  blank           9,621
033  21  !                   9
034  22  "                   5
035  23  #                  21
038  26  &                 202
039  27  '                 135
040  28  (                  30
041  29  )                  29
042  2a  *                   4
043  2b  +                   7
044  2c  ,                 112
045  2d  -               3,378
046  2e  .                 282
047  2f  /                 337
048  30  0             157,131
049  31  1              18,056
050  32  2              13,187
051  33  3               8,837
052  34  4               8,087
053  35  5               6,803
054  36  6               7,456
055  37  7               6,283
056  38  8               6,322
057  39  9               6,333
058  3a  :                  98
059  3b  ;                  24
064  40  @                   1
065  41  A               5,418
066  42  B               1,197
067  43  C               3,167
068  44  D               2,399
069  45  E               5,718
070  46  F               1,067
071  47  G               1,612
072  48  H               1,597
073  49  I               4,112
074  4a  J                 300
075  4b  K                 873
076  4c  L               4,877
077  4d  M               1,693
078  4e  N               4,099
079  4f  O               4,254
080  50  P               1,343
081  51  Q                 149
082  52  R               4,634
083  53  S               3,272
084  54  T               3,756
085  55  U               1,162
086  56  V                 865
087  57  W               1,488
088  58  X                 151
089  59  Y               1,369
090  5a  Z                  67
095  5f  _                 726
097  61  a                 817
098  62  b                  73
099  63  c                 147
100  64  d                 323
101  65  e                 887
102  66  f                  65
103  67  g                 199
104  68  h                 189
105  69  i                 498
107  6b  k                 233
108  6c  l                 419
109  6d  m                 111
110  6e  n                 616
111  6f  o                 872
112  70  p                 107
113  71  q                   4
114  72  r                 581
115  73  s                 252
116  74  t                 390
117  75  u                 132
118  76  v                 172
119  77  w                  74
120  78  x                  13
121  79  y                  90
122  7a  z                   7
124  7c  |             72,436
125  7d  }                  35
------------------------------
Total                  394,625

It is not clear to me what the problem characters - unprintable/special or
not - but I tried replacing the "}" character (and the comma previously) to
no avail.  

Separately I think I isolated the fields that contain the problems.  Is
there a way to ignore/remove individual fields in a txt file from within
Stata?

Thank you for your efforts in helping me with this issue.

Ben

Ben Hoen
LBNL
Office: 845-758-1896
Cell: 718-812-7589


-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of Sergiy Radyakin
Sent: Thursday, February 20, 2014 1:28 PM
To: [email protected]
Subject: Re: st: insheet and dropping cases

Ben,

-- the problem is likely caused by presence of unprintable characters
in the file, that are tolerated by StatTransfer, but not by Stata;

-- character with ASCII code 255 is a usual suspect;

-- pasting raw data to statalist is likely not to reveal the problem,
since the special characters might not survive massaging throw emails;

-- isolating the problem in the text editor into a new file could help
(keep the last record read in correctly and one immediately after),
then make the file available through a link, to retain its binary
structure, not all text editors will retain special chars on save;

-- use hexdump "file" , analyze tabulate to see unprintable
characters, then search for them in the file or use filefilter;

-- see "zap gremlins" for relevant tactic.

On the bright side: you are lucky you have 363 cases. Last time I had
this problem, only 16gb out of 40gb were read in. Try to open that
file in the notepad :)

Hope this helps.

Best, Sergiy Radyakin


On Thu, Feb 20, 2014 at 12:34 PM, Radwin, David <[email protected]> wrote:
> One other possibility is to use -inputst-, a Stata program that calls
Stat/Transfer (part of -stcmd- by Roger Newson and available at SSC).
>
> This workaround is probably less computationally efficient than the
suggestions from others, but since you already know that Stat/Transfer
works, this approach might be faster and easier than trying to figure out
the problem with your text files and -insheet- or -import delimited-.
>
> David
> --
> David Radwin, Senior Research Associate
> Education and Workforce Development
> RTI International
> 2150 Shattuck Ave. Suite 800, Berkeley, CA 94704
> Phone: 510-665-8274
>
> www.rti.org/education
>
>
>> -----Original Message-----
>> From: [email protected] [mailto:owner-
>> [email protected]] On Behalf Of Phil Schumm
>> Sent: Thursday, February 20, 2014 6:38 AM
>> To: Statalist Statalist
>> Subject: Re: st: insheet and dropping cases
>>
>> On Feb 20, 2014, at 8:28 AM, Ben Hoen <[email protected]> wrote:
>> > Hexdump I had never used.  This is what it returned:
>>
>> <snip>
>>
>> > Do you see anything suspicious here?  (I replaced all the commas with
>> "_", using filefilter - another great suggestion -  wondering if that was
>> causing any issues and insheet still returned 184 observations.)
>>
>>
>> I don't see anything obvious -- you'll need to look at the file directly.
>> Is Stata reading the first 184 observations, or are the 184 observations
>> from different places in the file?  Check that first, and if you are
>> getting the first 184 observations, then look at lines 184-6 (depending
on
>> whether the file has a header line).  Something has to be going on there.
>>
>>
>> -- Phil
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index