[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

RE: st: problem in uploading data into Stata - data "changes"

From   "Nick Cox" <>
To   <>
Subject   RE: st: problem in uploading data into Stata - data "changes"
Date   Tue, 1 Jul 2008 19:00:39 +0100

You would be much better off reading in your identifiers either as
string variables or as doubles. Stata can't hold 14 digit variables
exactly in floats. This is documented in several places: -search
precision- for some. 

If you input the identifiers as string, I can see no reason why you
should also want to -encode- them. 

-format-ting after input will never put back precision that was lost on
input. That is shutting the stable door after the horse has bolted. 


Gisella Young

Thank you for the replies. I have been unable to resolve the problem, so
am copying more details below as requested.

The data in the original text dataset looks as follows
1010100100050101112101 var3 var 4...        
1010100100050101112102  var3 var 4...        
1010100100050101112104  var3 var 4... 
1010100100050101112303  var3 var 4... 
1010100100050101113101  var3 var 4... 

The number in the first column is actually the first 2 variables, var1
is 14 digits and var2 is 8 digits. In the text dataset there is no space
between them. Actually neither var1 nor var2 are supposed to be unique,
but the combination of them is (and is in the original data). (Although
they do need to be analysed separately - var1 is the person identifier
and var2 is the activity).

I am now using stat transfer to convert the file (specifying the option
ASCII - Delimited). When I look at the data in the "view" option in stat
transfer it looks fine. One relevant point might be that in the
'variables' window of stat transfer, the first variable (which is
actually var1 and var2 which it is treating as one) is listed as string
while the others are floats.

The good news is that I can now make the transfer and the col1 variable
that comes up in Stata (of 22 digits, combining var1 and var2) is
unique. One problem however is that when I try to encode this variable
'col1', it does not work as I get error message 134 (that I have tried
to encode too many values). There are just under 1.5 million

I then tried specifying 'col1' in stat transfer as either a float or
long variable, but neither or these work - with long all the variables
come up in Stata as 0, and with float they are no longer unique (no
matter how many digits I allow for when formatting the variable).

I guess one option would be to convert them using Stattransfer in the
original string format, and then find a way of encoding the variables
(despite the problem of too many observations) and then somehow
splitting the 'col1' variable into the 2 variables var1 (first 14
digits) and var2 (next 8 digits).

When I try using infix, my command is:
..infix var1 1-14 var2 15-22 using "filename"

I then format the variables to give them enough places (format %16.0g
var1 var2). When I sort by var1 var2, my first 3 observations are as
follows - clearly the combination of var1 and var2 is not unique:

var1	var2
10101000765440	1111101
10101000765440	1111101
10101000765440	1111101

Any suggestions would be highly appreciated.


--- On Tue, 7/1/08, Steven Samuels <> wrote:

> From: Steven Samuels <>
> Subject: Re: st: problem in uploading data into Stata - data "changes"
> To:
> Date: Tuesday, July 1, 2008, 3:18 PM
> Gisella,
> Show us an example of a data line and your -infix-
> statements  Also,  
> what are the item separators in your text file (commas,
> tabs,..) ?   
> If Excel can figure out the variable columns, then
> StatTransfer can  
> also (see ASCII input options); there is no need to go
> through Excel.
> -Steve
> On Jul 1, 2008, at 11:05 AM, Gisella Young wrote:
> > Dear all,
> >
> > I am trying to load a datafile in text format into
> Stata. I am  
> > using the infix command. The problem is that 1 column
> of data (the  
> > firm column, which is the unique identification number
> for each  
> > observation, is different when I open it in Stata as
> from what I  
> > can see in the original text file. In fact I have
> several such text  
> > files for various years, and in every case the problem
> is the same:  
> > all variables upload correctly except for the first
> one. Not only  
> > is that number different but it is no longer unique to
> each  
> > observation. It is however the same number of digits
> as the  
> > original. I have checked that the infix command is
> specified  
> > correctly (eg correct number of digits).
> >
> > I have also tried saving the text file into excel (and
> applying  
> > text-to-columns) and then converting it into a stata
> file using  
> > Stat-transfer. When I do this all the variable upload
> correctly  
> > into Stata. The problem is that I cannot do this for
> the entire  
> > files because of their size (the limits of Excel mean
> that only a  
> > small fraction of each file can be accommodated), so
> this is not a  
> > solution.
> >
> > I realise that it may be difficult for someone to
> suggest an  
> > explanation/solution without seeing the actual data,
> but I wonder  
> > whether there are any suggestions as to what the
> problem might  
> > potentially be, and how to get around it?
> >
> > Many thanks,
> > Gisella
> >
> >
> >
> >
> > *
> > *   For searches and help try:
> > *
> > *
> > *
> *
> *   For searches and help try:
> *
> *
> *


*   For searches and help try:

*   For searches and help try:

© Copyright 1996–2017 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index