[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
wgould@stata.com (William Gould) |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: wrong number of observations after append |

Date |
Tue, 03 Sep 2002 11:11:47 -0500 |

Karyen Chu <k_chu@uclink.berkeley.edu> reports, > I have 50 individual-specific datasets ([...] one subject ID per dataset) > [..] > > I then appended all 50 datasets into 1 very large dataset and discovered > that some of the subjects now have the wrong number of observations! Some > have too many observations and some have too few observations although the > total number of observations for all 50 subjects is correct. Karyen provided -table- output along with annotated notes about what the number ought to be. One is tempted just to dimiss this except that in Karyen's posting is was obvious that she had already done a lot of work and she provided enough information so that I, too, must admit that I am at a loss. I took Karyen's table, deleted all the rows where numbers were as Karyen excepted, and then added my own annotation: ---------------------- My annotation subjid | Freq. Karyen's annoation freq-exp ----------+----------- ------------------------ ------- 28722 | 19,296 (should have 19,299 obs) -3 50910 | 23,971 (should have 23,972 obs) -1 119669 | 59,245 (should have 59,244 obs) +1 209171 | 7,991 (should have 7,989 obs) +2 323652 | 2,267 (should have 2,269 obs) -2 459852 | 14,024 (should have 14,023 obs) +1 666481 | 16,679 (should have 16,678 obs) +1 836796 | 11,495 (should have 11,494 obs) +1 ---------------------- ---------- 0 I also checked all the subjid numbers to see if there could be a float rounding issue. The answer is no, all numbers fit into a float just fine and, even if they did not, all the commands Karyen used are advertised as being robust to such problems. So I, like Karyen, am at a loss. I do, however, have advice on how to figure out what is going on. 1. Trust nothing and no one ---------------------------- In tracing this down, do not trust Stata, do not trust what you have been told, do not even trust yourself. Verify everything. My suspicion is that Karyen's assumptions are wrong and that the datasets with which Karyen started have the number of observations Stata reported in the final -table-. I would not even have bothered to respond to Karyen's posting had she not written down the number she expected next to each group. Where, I asked myself, did she get those numbers if not from already examining her assumptions? Even so, I ask that Karyen once again examine her assumptions. -use- the datasets user_usage.X1.28722.dta, user_usage.X1.50910.dta, ..., -describe- each, and verify that the number of observations is as you expect. If you find that they are, ask yourself whether you have more than one copy of the datasets, and whether you perhaps used another copy when you ran your program. 2. Mechanize ------------- Karyen claims to have run the code --------------------------------------------------------------------------- use user_usage.X1.28722.dta, clear describe foreach subj of numlist 50910 54476 87734 119669 123614 /* */ 127871 130008 130722 162245 194574 209171 226711 228761 284310 323652 /* */ 326175 328958 360402 370576 371133 407487 413293 415301 417756 459852 /* */ 462509 475134 476368 484595 487508 507428 564155 577895 580566 598037 /* */ 666481 677056 717037 /* */ 751384 763586 788300 828191 836796 876142 /* */ 917942 929316 943493 955867 968002 { append using user_usage.X1.`subj'.dta capture noisily save user_usage.X1.merge.dta, replace } compress sort subjid startedate start_hr start_min start_sec /* */ endedate end_hr end_min end_sec save user_usage.X1.merge.dta, replace describe table subjid --------------------------------------------------------------------------- Verify this is true. Put the code in a do-file (if it is not already), turn on a log, and and run the do-file. Close the log and keep it. Do you still have the problem? If so, rename this do-file problem1.do. Anytime you want to recreate the problem, all you have do is type "do problem1". 2. Simplify ------------ We are now going to make problem2.do, problem3.do, ..., at each step simplifying the problem a little until (1) it becomes obvious to us what the problem is or (2) the problem vanishes. If (2), we will have to explore what is happening beteen problem{K}.do and problem{K+1}.do, perhaps by breaking it into more steps. The first thing I am going to suggest is silly, but it will set the standard by which I want Karyen to operate. In each step, we make a small change. Why, Karyen, the -capture noisily- in front of -save user_usage.X1.merge.dta, replace-? Well, I don't care about the answer, and I cannot imagine that -capture noisily- is causing any problem, but -capture- has the ability to hide problems, so remove the -capture noisily-. Call that new do-file problem2.do. Run it. Still have the problem? In problem3.do, let's consider the idea -table- is wrong. Add tabulate subid, missing before or after the -table-. Run it. Output should match. Does it? In problem4.do, let's focus on just the ids that have the problem. Karyen's code currently begins use user_usage.X1.28722.dta, clear describe foreach subj of numlist 50910 54476 87734 119669 123614 /* */ 127871 130008 130722 162245 194574 209171 226711 228761 284310 323652 /* */ 326175 328958 360402 370576 371133 407487 413293 415301 417756 459852 /* */ 462509 475134 476368 484595 487508 507428 564155 577895 580566 598037 /* */ 666481 677056 717037 /* */ 751384 763586 788300 828191 836796 876142 /* */ 917942 929316 943493 955867 968002 { change the -foreach- to include just 28722, 50910, 119669, 209171, 323652, 459852, 666481, and 836796. Run the do-file. Still have the problem? In problem5.do, let's try just the pair 50910 and 119669: use user_usage.X1.50910.dta, clear describe foreach subj of numlist 119669 { Run it. Still have the problem? Keep going like this. At some point, let's introduce into the code the proof that the input datasets are as you claim: use user_usage.X1...., clear describe tabulate subjid, missing /* <- new */ foreach subj of numlist ... { preserve display "dataset for `subj'": use user_usage.X1.`subj'.dta, clear tabulate subjid, missing restore append using user_usage.X1.`subj'.dta display "result:" tabulate subjid, missing save user_usage.X1.merge.dta, replace } ... Follow these steps and eventually, Karyen, you will find the problem. -- Bill wgould@stata.com * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**st: adjksm bugfix available at ssc-ideas***From:*Isaias salgado <isalgado@servidor.unam.mx>

**Re: st: wrong number of observations after append***From:*Karyen Chu <k_chu@uclink.berkeley.edu>

- Prev by Date:
**[no subject]** - Next by Date:
**st: outdat** - Previous by thread:
**RE: st: Summing up values specific for each individual.** - Next by thread:
**Re: st: wrong number of observations after append** - Index(es):

© Copyright 1996–2014 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |