Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: insheet and dropping cases


From   Sergiy Radyakin <serjradyakin@gmail.com>
To   "statalist@hsphsun2.harvard.edu" <statalist@hsphsun2.harvard.edu>
Subject   Re: st: insheet and dropping cases
Date   Thu, 20 Feb 2014 21:28:38 -0500

Dear Ben,

with such a replace the offending text (quoted earlier)
N 89 DEG 46'47" E
will turn into
N 89 DEG 46'47' E
which might cause problems later if you ever need to recover the exact
coordinates from imported data.

Other records that you exposed were already in DEG-MIN-SEC notation:
N 00 DEG 50 MIN 54 SEC W 431. 78 FT, N 47 DEG 30 MIN W 522.2 6 FT

With that I would probably first convert the data replacing double
quotes into something bright and shiny like #@#@#@  (literally these
symbols, don't take me wrong), then asserting that this sequence
appears only in the variable holding the coordinates, then going back
to original files and apply -filefilter- to replace double quotes with
SEC to match the alternative coordinates notation already present in
the file. With that you get a cleaner file and solve the importing
problem.

Alternatively the standalone converter tab2dta.exe (shameless
self-promotion) is tolerant to unbalanced quotes within string values,
but requires tab character as a separator:
http://radyakin.org/transfer/tab2dta/tab2dta.htm

Replacing pipes with tabs should be straightforward with -filefilter-.
With this approach you distort the separators, leaving the content
intact, while with the first approach you distort the content,
retaining the separators intact. Once data is read in into Stata,
separators don't exist anymore, so nobody is hurt if they were
transformed in the process. Content alterations might matter.

Best, Sergiy Radyakin






On Thu, Feb 20, 2014 at 3:48 PM, Ben Hoen <bhoen@lbl.gov> wrote:
> I found the workaround for changing the double quote to single:
>
> filefilter IL.txt IL2.txt, from(\Q) to(\RQ) replace
>
> Thank you all for helping me through this frustrating problem today.
>
> As always, I really do not know what I would do without the brilliance and
> helpfulness of this online community.
>
> Cheers,
>
> Ben
>
> Ben Hoen
> LBNL
> Office: 845-758-1896
> Cell: 718-812-7589
>
>
> -----Original Message-----
> From: owner-statalist@hsphsun2.harvard.edu
> [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of Sergiy Radyakin
> Sent: Thursday, February 20, 2014 3:21 PM
> To: statalist@hsphsun2.harvard.edu
> Subject: Re: st: insheet and dropping cases
>
> Hello Ben,
> report is helpful and it is safe to post it as it is Stata's output,
> which doesn't have anything unprintable. Note how Stata writes an
> escaped sequence \n and \r for unprintable characters 10 and 13. For
> description of unprintable ASCII characters and their role in control
> of the text see eg the following page:
> http://www.juniper.net/techpubs/en_US/idp5.1/topics/reference/general/intrus
> ion-detection-prevention-custom-attack-object-extended-ascii.html
> or google them (plenty of links); most of them are archaic.
>
> We focus on the 0-31 range. You only have 10 and 13, which is a
> typical end-of-line pattern \r\n. There are no gremlins to zap, so to
> speak. Also the /r and /n are having the same frequency, which means
> that they are also likely to be properly paired at the end of the
> line.
>
> There is also nothing in the upper page (non-ASCII) characters 128-255.
>
> To be sure that the report itself is correct, verify that the total
> file length as reported by OS is the sum of frequencies of all
> characters (394,625).
>
> I note the use of the single and double quotes to denote minutes in
> the coordinates. Perhaps this can confuse Stata. In some records you
> posted I see "MIN" as a word, in some cases it is ". When seeing a
> quote, even in a something-separated file, Stata would seek to the end
> of the string, which could be a long way from where the quote has
> opened. If you expect quotes to denote seconds and single quotes
> minutes, do the filefilter for them in advance, and retry.
>
> Hope this helps, Sergiy Radyakin
>
>
>
>
>
> On Thu, Feb 20, 2014 at 2:19 PM, Ben Hoen <bhoen@lbl.gov> wrote:
>> Hi Sergiy,
>>
>> I am pasting in the tabulate from hexdump (not knowing how to provide a
> link
>> to those files as you suggest):
>>
>> Tabulation (character not listed if     unobserved):
>> Dec Hex  Char        Frequency
>> ------------------------------
>> 010  0a  \n                364
>> 013  0d  \r                364
>> 032  20  blank           9,621
>> 033  21  !                   9
>> 034  22  "                   5
>> 035  23  #                  21
>> 038  26  &                 202
>> 039  27  '                 135
>> 040  28  (                  30
>> 041  29  )                  29
>> 042  2a  *                   4
>> 043  2b  +                   7
>> 044  2c  ,                 112
>> 045  2d  -               3,378
>> 046  2e  .                 282
>> 047  2f  /                 337
>> 048  30  0             157,131
>> 049  31  1              18,056
>> 050  32  2              13,187
>> 051  33  3               8,837
>> 052  34  4               8,087
>> 053  35  5               6,803
>> 054  36  6               7,456
>> 055  37  7               6,283
>> 056  38  8               6,322
>> 057  39  9               6,333
>> 058  3a  :                  98
>> 059  3b  ;                  24
>> 064  40  @                   1
>> 065  41  A               5,418
>> 066  42  B               1,197
>> 067  43  C               3,167
>> 068  44  D               2,399
>> 069  45  E               5,718
>> 070  46  F               1,067
>> 071  47  G               1,612
>> 072  48  H               1,597
>> 073  49  I               4,112
>> 074  4a  J                 300
>> 075  4b  K                 873
>> 076  4c  L               4,877
>> 077  4d  M               1,693
>> 078  4e  N               4,099
>> 079  4f  O               4,254
>> 080  50  P               1,343
>> 081  51  Q                 149
>> 082  52  R               4,634
>> 083  53  S               3,272
>> 084  54  T               3,756
>> 085  55  U               1,162
>> 086  56  V                 865
>> 087  57  W               1,488
>> 088  58  X                 151
>> 089  59  Y               1,369
>> 090  5a  Z                  67
>> 095  5f  _                 726
>> 097  61  a                 817
>> 098  62  b                  73
>> 099  63  c                 147
>> 100  64  d                 323
>> 101  65  e                 887
>> 102  66  f                  65
>> 103  67  g                 199
>> 104  68  h                 189
>> 105  69  i                 498
>> 107  6b  k                 233
>> 108  6c  l                 419
>> 109  6d  m                 111
>> 110  6e  n                 616
>> 111  6f  o                 872
>> 112  70  p                 107
>> 113  71  q                   4
>> 114  72  r                 581
>> 115  73  s                 252
>> 116  74  t                 390
>> 117  75  u                 132
>> 118  76  v                 172
>> 119  77  w                  74
>> 120  78  x                  13
>> 121  79  y                  90
>> 122  7a  z                   7
>> 124  7c  |             72,436
>> 125  7d  }                  35
>> ------------------------------
>> Total                  394,625
>>
>> It is not clear to me what the problem characters - unprintable/special or
>> not - but I tried replacing the "}" character (and the comma previously)
> to
>> no avail.
>>
>> Separately I think I isolated the fields that contain the problems.  Is
>> there a way to ignore/remove individual fields in a txt file from within
>> Stata?
>>
>> Thank you for your efforts in helping me with this issue.
>>
>> Ben
>>
>> Ben Hoen
>> LBNL
>> Office: 845-758-1896
>> Cell: 718-812-7589
>>
>>
>> -----Original Message-----
>> From: owner-statalist@hsphsun2.harvard.edu
>> [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of Sergiy Radyakin
>> Sent: Thursday, February 20, 2014 1:28 PM
>> To: statalist@hsphsun2.harvard.edu
>> Subject: Re: st: insheet and dropping cases
>>
>> Ben,
>>
>> -- the problem is likely caused by presence of unprintable characters
>> in the file, that are tolerated by StatTransfer, but not by Stata;
>>
>> -- character with ASCII code 255 is a usual suspect;
>>
>> -- pasting raw data to statalist is likely not to reveal the problem,
>> since the special characters might not survive massaging throw emails;
>>
>> -- isolating the problem in the text editor into a new file could help
>> (keep the last record read in correctly and one immediately after),
>> then make the file available through a link, to retain its binary
>> structure, not all text editors will retain special chars on save;
>>
>> -- use hexdump "file" , analyze tabulate to see unprintable
>> characters, then search for them in the file or use filefilter;
>>
>> -- see "zap gremlins" for relevant tactic.
>>
>> On the bright side: you are lucky you have 363 cases. Last time I had
>> this problem, only 16gb out of 40gb were read in. Try to open that
>> file in the notepad :)
>>
>> Hope this helps.
>>
>> Best, Sergiy Radyakin
>>
>>
>> On Thu, Feb 20, 2014 at 12:34 PM, Radwin, David <dradwin@rti.org> wrote:
>>> One other possibility is to use -inputst-, a Stata program that calls
>> Stat/Transfer (part of -stcmd- by Roger Newson and available at SSC).
>>>
>>> This workaround is probably less computationally efficient than the
>> suggestions from others, but since you already know that Stat/Transfer
>> works, this approach might be faster and easier than trying to figure out
>> the problem with your text files and -insheet- or -import delimited-.
>>>
>>> David
>>> --
>>> David Radwin, Senior Research Associate
>>> Education and Workforce Development
>>> RTI International
>>> 2150 Shattuck Ave. Suite 800, Berkeley, CA 94704
>>> Phone: 510-665-8274
>>>
>>> www.rti.org/education
>>>
>>>
>>>> -----Original Message-----
>>>> From: owner-statalist@hsphsun2.harvard.edu [mailto:owner-
>>>> statalist@hsphsun2.harvard.edu] On Behalf Of Phil Schumm
>>>> Sent: Thursday, February 20, 2014 6:38 AM
>>>> To: Statalist Statalist
>>>> Subject: Re: st: insheet and dropping cases
>>>>
>>>> On Feb 20, 2014, at 8:28 AM, Ben Hoen <bhoen@lbl.gov> wrote:
>>>> > Hexdump I had never used.  This is what it returned:
>>>>
>>>> <snip>
>>>>
>>>> > Do you see anything suspicious here?  (I replaced all the commas with
>>>> "_", using filefilter - another great suggestion -  wondering if that
> was
>>>> causing any issues and insheet still returned 184 observations.)
>>>>
>>>>
>>>> I don't see anything obvious -- you'll need to look at the file
> directly.
>>>> Is Stata reading the first 184 observations, or are the 184 observations
>>>> from different places in the file?  Check that first, and if you are
>>>> getting the first 184 observations, then look at lines 184-6 (depending
>> on
>>>> whether the file has a header line).  Something has to be going on
> there.
>>>>
>>>>
>>>> -- Phil
>>>
>>> *
>>> *   For searches and help try:
>>> *   http://www.stata.com/help.cgi?search
>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>> *   http://www.ats.ucla.edu/stat/stata/
>>
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> *   http://www.ats.ucla.edu/stat/stata/
>>
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> *   http://www.ats.ucla.edu/stat/stata/
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index