Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: wrong number of observations after append


From   Karyen Chu <k_chu@uclink.berkeley.edu>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: wrong number of observations after append
Date   Thu, 05 Sep 2002 00:21:28 -0700


Thank you to William Gould and Nick Cox for your helpful suggestions, particularly for the suggested methodical process for finding the problem.

It turns out that despite having been advertised as datasets for a single subject, there were observations mixed in from some of the other subjects into the problem datasets. I had simply read each subject's dataset into Stata and ran a describe but had not checked the claim that each dataset contained only one subject id.

Thank you again.


Karyen Chu




At 11:11 AM 9/3/2002 -0500, you wrote:

Karyen Chu <k_chu@uclink.berkeley.edu> reports,

> I have 50 individual-specific datasets ([...] one subject ID per dataset)
> [..]
>
> I then appended all 50 datasets into 1 very large dataset and discovered
> that some of the subjects now have the wrong number of observations! Some
> have too many observations and some have too few observations although the
> total number of observations for all 50 subjects is correct.

Karyen provided -table- output along with annotated notes about what the
number ought to be. One is tempted just to dimiss this except that
in Karyen's posting is was obvious that she had already done a lot of work
and she provided enough information so that I, too, must admit that I am
at a loss.

I took Karyen's table, deleted all the rows where numbers were as Karyen
excepted, and then added my own annotation:

---------------------- My annotation
subjid | Freq. Karyen's annoation freq-exp
----------+----------- ------------------------ -------
28722 | 19,296 (should have 19,299 obs) -3
50910 | 23,971 (should have 23,972 obs) -1
119669 | 59,245 (should have 59,244 obs) +1
209171 | 7,991 (should have 7,989 obs) +2
323652 | 2,267 (should have 2,269 obs) -2
459852 | 14,024 (should have 14,023 obs) +1
666481 | 16,679 (should have 16,678 obs) +1
836796 | 11,495 (should have 11,494 obs) +1
---------------------- ----------
0

I also checked all the subjid numbers to see if there could be a float rounding
issue. The answer is no, all numbers fit into a float just fine and, even if
they did not, all the commands Karyen used are advertised as being robust to
such problems.

So I, like Karyen, am at a loss. I do, however, have advice on how to figure
out what is going on.


1. Trust nothing and no one
----------------------------

In tracing this down, do not trust Stata, do not trust what you have been
told, do not even trust yourself. Verify everything.

My suspicion is that Karyen's assumptions are wrong and that the datasets with
which Karyen started have the number of observations Stata reported in the
final -table-. I would not even have bothered to respond to Karyen's posting
had she not written down the number she expected next to each group. Where, I
asked myself, did she get those numbers if not from already examining her
assumptions?

Even so, I ask that Karyen once again examine her assumptions. -use- the
datasets user_usage.X1.28722.dta, user_usage.X1.50910.dta, ..., -describe-
each, and verify that the number of observations is as you expect.

If you find that they are, ask yourself whether you have more than one
copy of the datasets, and whether you perhaps used another copy when you
ran your program.


2. Mechanize
-------------

Karyen claims to have run the code


---------------------------------------------------------------------------
use user_usage.X1.28722.dta, clear
describe

foreach subj of numlist 50910 54476 87734 119669 123614 /*
*/ 127871 130008 130722 162245 194574 209171 226711 228761 284310 323652 /*
*/ 326175 328958 360402 370576 371133 407487 413293 415301 417756 459852 /*
*/ 462509 475134 476368 484595 487508 507428 564155 577895 580566 598037 /*
*/ 666481 677056 717037 /*
*/ 751384 763586 788300 828191 836796 876142 /*
*/ 917942 929316 943493 955867 968002 {

append using user_usage.X1.`subj'.dta

capture noisily save user_usage.X1.merge.dta, replace
}
compress

sort subjid startedate start_hr start_min start_sec /*
*/ endedate end_hr end_min end_sec

save user_usage.X1.merge.dta, replace

describe

table subjid

---------------------------------------------------------------------------

Verify this is true. Put the code in a do-file (if it is not already), turn
on a log, and and run the do-file. Close the log and keep it. Do you still
have the problem?

If so, rename this do-file problem1.do. Anytime you want to recreate the
problem, all you have do is type "do problem1".


2. Simplify
------------

We are now going to make problem2.do, problem3.do, ..., at each step
simplifying the problem a little until (1) it becomes obvious to us what the
problem is or (2) the problem vanishes. If (2), we will have to explore what
is happening beteen problem{K}.do and problem{K+1}.do, perhaps by breaking it
into more steps.

The first thing I am going to suggest is silly, but it will set the standard
by which I want Karyen to operate. In each step, we make a small change.

Why, Karyen, the -capture noisily- in front of -save user_usage.X1.merge.dta,
replace-? Well, I don't care about the answer, and I cannot imagine that
-capture noisily- is causing any problem, but -capture- has the ability to
hide problems, so remove the -capture noisily-. Call that new do-file
problem2.do. Run it. Still have the problem?

In problem3.do, let's consider the idea -table- is wrong. Add

tabulate subid, missing

before or after the -table-. Run it. Output should match. Does it?

In problem4.do, let's focus on just the ids that have the problem. Karyen's
code currently begins

use user_usage.X1.28722.dta, clear
describe

foreach subj of numlist 50910 54476 87734 119669 123614 /*
*/ 127871 130008 130722 162245 194574 209171 226711 228761 284310 323652 /*
*/ 326175 328958 360402 370576 371133 407487 413293 415301 417756 459852 /*
*/ 462509 475134 476368 484595 487508 507428 564155 577895 580566 598037 /*
*/ 666481 677056 717037 /*
*/ 751384 763586 788300 828191 836796 876142 /*
*/ 917942 929316 943493 955867 968002 {

change the -foreach- to include just 28722, 50910, 119669, 209171, 323652,
459852, 666481, and 836796. Run the do-file. Still have the problem?

In problem5.do, let's try just the pair 50910 and 119669:

use user_usage.X1.50910.dta, clear
describe

foreach subj of numlist 119669 {

Run it. Still have the problem?

Keep going like this. At some point, let's introduce into the code the
proof that the input datasets are as you claim:

use user_usage.X1...., clear
describe
tabulate subjid, missing /* <- new */
foreach subj of numlist ... {
preserve
display "dataset for `subj'":
use user_usage.X1.`subj'.dta, clear
tabulate subjid, missing
restore
append using user_usage.X1.`subj'.dta
display "result:"
tabulate subjid, missing
save user_usage.X1.merge.dta, replace
}
...

Follow these steps and eventually, Karyen, you will find the problem.

-- Bill
wgould@stata.com
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index