[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
Karyen Chu <k_chu@uclink.berkeley.edu> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: wrong number of observations after append |

Date |
Thu, 05 Sep 2002 00:21:28 -0700 |

Thank you to William Gould and Nick Cox for your helpful suggestions, particularly for the suggested methodical process for finding the problem.

It turns out that despite having been advertised as datasets for a single subject, there were observations mixed in from some of the other subjects into the problem datasets. I had simply read each subject's dataset into Stata and ran a describe but had not checked the claim that each dataset contained only one subject id.

Thank you again.

Karyen Chu

At 11:11 AM 9/3/2002 -0500, you wrote:

Karyen Chu <k_chu@uclink.berkeley.edu> reports,

> I have 50 individual-specific datasets ([...] one subject ID per dataset)

> [..]

>

> I then appended all 50 datasets into 1 very large dataset and discovered

> that some of the subjects now have the wrong number of observations! Some

> have too many observations and some have too few observations although the

> total number of observations for all 50 subjects is correct.

Karyen provided -table- output along with annotated notes about what the

number ought to be. One is tempted just to dimiss this except that

in Karyen's posting is was obvious that she had already done a lot of work

and she provided enough information so that I, too, must admit that I am

at a loss.

I took Karyen's table, deleted all the rows where numbers were as Karyen

excepted, and then added my own annotation:

---------------------- My annotation

subjid | Freq. Karyen's annoation freq-exp

----------+----------- ------------------------ -------

28722 | 19,296 (should have 19,299 obs) -3

50910 | 23,971 (should have 23,972 obs) -1

119669 | 59,245 (should have 59,244 obs) +1

209171 | 7,991 (should have 7,989 obs) +2

323652 | 2,267 (should have 2,269 obs) -2

459852 | 14,024 (should have 14,023 obs) +1

666481 | 16,679 (should have 16,678 obs) +1

836796 | 11,495 (should have 11,494 obs) +1

---------------------- ----------

0

I also checked all the subjid numbers to see if there could be a float rounding

issue. The answer is no, all numbers fit into a float just fine and, even if

they did not, all the commands Karyen used are advertised as being robust to

such problems.

So I, like Karyen, am at a loss. I do, however, have advice on how to figure

out what is going on.

1. Trust nothing and no one

----------------------------

In tracing this down, do not trust Stata, do not trust what you have been

told, do not even trust yourself. Verify everything.

My suspicion is that Karyen's assumptions are wrong and that the datasets with

which Karyen started have the number of observations Stata reported in the

final -table-. I would not even have bothered to respond to Karyen's posting

had she not written down the number she expected next to each group. Where, I

asked myself, did she get those numbers if not from already examining her

assumptions?

Even so, I ask that Karyen once again examine her assumptions. -use- the

datasets user_usage.X1.28722.dta, user_usage.X1.50910.dta, ..., -describe-

each, and verify that the number of observations is as you expect.

If you find that they are, ask yourself whether you have more than one

copy of the datasets, and whether you perhaps used another copy when you

ran your program.

2. Mechanize

-------------

Karyen claims to have run the code

---------------------------------------------------------------------------

use user_usage.X1.28722.dta, clear

describe

foreach subj of numlist 50910 54476 87734 119669 123614 /*

*/ 127871 130008 130722 162245 194574 209171 226711 228761 284310 323652 /*

*/ 326175 328958 360402 370576 371133 407487 413293 415301 417756 459852 /*

*/ 462509 475134 476368 484595 487508 507428 564155 577895 580566 598037 /*

*/ 666481 677056 717037 /*

*/ 751384 763586 788300 828191 836796 876142 /*

*/ 917942 929316 943493 955867 968002 {

append using user_usage.X1.`subj'.dta

capture noisily save user_usage.X1.merge.dta, replace

}

compress

sort subjid startedate start_hr start_min start_sec /*

*/ endedate end_hr end_min end_sec

save user_usage.X1.merge.dta, replace

describe

table subjid

---------------------------------------------------------------------------

Verify this is true. Put the code in a do-file (if it is not already), turn

on a log, and and run the do-file. Close the log and keep it. Do you still

have the problem?

If so, rename this do-file problem1.do. Anytime you want to recreate the

problem, all you have do is type "do problem1".

2. Simplify

------------

We are now going to make problem2.do, problem3.do, ..., at each step

simplifying the problem a little until (1) it becomes obvious to us what the

problem is or (2) the problem vanishes. If (2), we will have to explore what

is happening beteen problem{K}.do and problem{K+1}.do, perhaps by breaking it

into more steps.

The first thing I am going to suggest is silly, but it will set the standard

by which I want Karyen to operate. In each step, we make a small change.

Why, Karyen, the -capture noisily- in front of -save user_usage.X1.merge.dta,

replace-? Well, I don't care about the answer, and I cannot imagine that

-capture noisily- is causing any problem, but -capture- has the ability to

hide problems, so remove the -capture noisily-. Call that new do-file

problem2.do. Run it. Still have the problem?

In problem3.do, let's consider the idea -table- is wrong. Add

tabulate subid, missing

before or after the -table-. Run it. Output should match. Does it?

In problem4.do, let's focus on just the ids that have the problem. Karyen's

code currently begins

use user_usage.X1.28722.dta, clear

describe

foreach subj of numlist 50910 54476 87734 119669 123614 /*

*/ 127871 130008 130722 162245 194574 209171 226711 228761 284310 323652 /*

*/ 326175 328958 360402 370576 371133 407487 413293 415301 417756 459852 /*

*/ 462509 475134 476368 484595 487508 507428 564155 577895 580566 598037 /*

*/ 666481 677056 717037 /*

*/ 751384 763586 788300 828191 836796 876142 /*

*/ 917942 929316 943493 955867 968002 {

change the -foreach- to include just 28722, 50910, 119669, 209171, 323652,

459852, 666481, and 836796. Run the do-file. Still have the problem?

In problem5.do, let's try just the pair 50910 and 119669:

use user_usage.X1.50910.dta, clear

describe

foreach subj of numlist 119669 {

Run it. Still have the problem?

Keep going like this. At some point, let's introduce into the code the

proof that the input datasets are as you claim:

use user_usage.X1...., clear

describe

tabulate subjid, missing /* <- new */

foreach subj of numlist ... {

preserve

display "dataset for `subj'":

use user_usage.X1.`subj'.dta, clear

tabulate subjid, missing

restore

append using user_usage.X1.`subj'.dta

display "result:"

tabulate subjid, missing

save user_usage.X1.merge.dta, replace

}

...

Follow these steps and eventually, Karyen, you will find the problem.

-- Bill

wgould@stata.com

*

* For searches and help try:

* http://www.stata.com/support/faqs/res/findit.html

* http://www.stata.com/support/statalist/faq

* http://www.ats.ucla.edu/stat/stata/

* * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**References**:**Re: st: wrong number of observations after append***From:*wgould@stata.com (William Gould)

- Prev by Date:
**st: Instrumental variable est** - Next by Date:
**st: Graph Window (Linux)** - Previous by thread:
**Re: st: wrong number of observations after append** - Next by thread:
**st: adjksm bugfix available at ssc-ideas** - Index(es):

© Copyright 1996–2014 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |