[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
st: Re: Stata corrupting data?
Deborah Garvey (DGarvey@scu.edu) is concerned that her was corrupted:
> A worrisome event occurred yesterday while reading in a 1990 Census 5%
> PUMS CA sample abstracted from the IPUMS web site. The data set is 312m
> in size with N = 1.46 m and 84 vars.
> I set memory at 375 m. After I read in the data to STATA, ran
> descriptive stats, and saved the data, I then did a quick cross-tab of 2
> variables that should've yielded a 2x2 matrix. Instead, values were
> changed for a couple of observations, and I ended up with a 3x3 table.
> I not so calmly exited STATA, restarted my computer and checked the
> data. They seem to be fine.
> The data were seriously corrupted when I initially attempted to read in
> 1990 and 2000 5% PUMS abstracts for CA simultaneously. I verified with
> IPUMS that the problem was on my end, and not with their source data.
It is hard to figure out what might have gone wrong without seeing
the commands that were run and knowing the entire history of how
the data got to Deborah's computer.
I do not believe that the problem is in Stata. The part of Stata
dealing with data storage is essentially unchanged in Stata 8 from
Stata 7, and we have had no reports of data corruption in either
version that we were not able to trace to some external cause.
The amount of memory Deborah is allocating to hold this data is
fine. If Stata can work with her dataset with that memory allocation,
there is no need to allocate more to Stata.
The most common reason for data corruption that we see is when
data is copied via a FTP connection in ASCII mode rather than in
binary mode. ASCII mode treats a file like a text file and maps
what it things are ends of lines from, say, Unix to DOS. This does
not make sense to do with a binary file and will almost certainly
Another possibility is that the Stata dataset was created by some
external source that violated some property of an official Stata
dataset. For Stata 7, we have seen datasets created by some external
source that contain strings with 81 characters instead of the limit
of 80 in that version of Stata. Stata tries to read this data, but
the extra information in it can lead to corruption of memory.
We have also seen flaky memory on a computer lead to data corruption.
A computer that is running too hot can cause this as well. More rare,
but still possible, is a flaky hard disk controller. We have seen
this just a couple of times. Small files could be copied and read
fine from a hard drive, but when large amounts of data were read
from the drive, the controller would make mistakes and introduce
The best way for us to help Deborah will be if she can send the
dataset to our Technical Services department, perhaps burned onto a CD
since it is so large. Deborah should also let Technical Services
(firstname.lastname@example.org) know the exact commands she is using and
whether the data seems to be corrupt from the very beginning when she
reads it in or whether it occurs only after working with it for a while.
If the corruption only occurs after working with the data for a while,
it would be interesting to know if it is always becomes corrupt after
running a certain set of commands or if it just appears to happen at
* For searches and help try: