Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: frustrated by missing variables--collapase and merge


From   Julia Gamas <jgamas@mit.edu>
To   zhou.yu@usc.edu
Subject   Re: st: frustrated by missing variables--collapase and merge
Date   Tue, 29 Mar 2005 11:21:37 -0500

Thanks, this is interesting.  I do also work with large datasets and start every
Stata session by increasing the memory to 500m.  I do have more memory for that
purpose and wish I had more.  Also, I believe that the Stata that I have can
only hand x number of variables, isn't there another version of stata that can
handle more (or is that just for matrices)?.  Another command: "compress" can
help if you have very large datasets.  And finally, what I find with large
datasets is that I can "chop them up" using Stattransfer, collapse each one or
do operations on each one, and then merge the smaller databses back using, in
my case, the geographic area that they belong to.  Perhaps you've already tried
it.  
Thanks for the tip on bysort.
Julia

Quoting Zhou Yu <zhou.yu@usc.edu>:

> Julia, thanks a lot for your note.  The problem seems to be mitigated 
> when I set memory size less than 800m. But the missing variable problem 
> still exists even though I have tried various methods. Maybe I should 
> update my computer. I have not had such problems before.
> 
> Collapsing is time consuming. Here is a piece of suggestion which I 
> found quite useful:
> 
> 
> "I've never seen variables disappear like that in Stata, but I do have a 
> suggestion. If you are using such a large dataset and need virtual 
> memory, first I'd suggest buying more memory, it is cheap. Second, I 
> wouldn't use collapse, but would instead write the equivalent commands 
> directly. This approach can often save time avoiding doing things that 
> collapse needs to do because it is a general tool while you only need a 
> specific result. For example if your dataset has just x1 - x5 and you 
> want the means of x1-x4 by category of x5, I would :
> 
> for each var in varlist x1 x2 x3 x4 {
> bysort x5: replace `var'=sum(`var')/sum(`var'!=.)
> }
> bysort x5: keep if _n==_N
> 
> This approach will minimize the use of memory and should be quicker than 
> using collapse, trivially for small datasets but perhaps noticeably in a 
> large dataset.
> 
> Michael Blasnik
> michael.blasnik@verizon.net"
> 
> Thanks,
> 
> Zhou
> 
> 
> Julia Gamas wrote:
> 
> >Hi,
> >it depends on what you want to obtain from the collapse and merge.  By
> merging
> >you souldn't be losing any variables.  In fact, your dataset should get
> bigger.
> > If you had two variables of the same name, then one will get replaced. 
> Check
> >that you are merging using ALL the variables relevant to the merge.  For
> >example, if you want to merge by state and city, you would write:
> >"merge state city using yourdatabase".  I've fumbled up a few times and
> gotten
> >nonsense when instead I wrote:
> >"merge using yourdatabase", because Stata didn't know that I wanted it to
> merge
> >by state and city.  There are also several types of merges so you may want
> to
> >make sure that you're using the instructions for the type you want (you may
> >want to merge each line with the next, or merge each line by matching
> another
> >variable such as city or state or year, for example).
> >About collapse, you may lose any variables that aren't included in your
> >expression.  For example, lets say you have the following variables:
> >year var1 var2 var3 and you want to collapse your data set by year, then
> you'd
> >write something like:
> >"collapse (sum) var1 (mean) var2 (median) var 3, by (year)"
> >But if you forget one of the vars and do:
> >"collapse (sum) var1 (mean) var2, by (year)"
> >you'll lose var 3.
> >Finally, there will be variables which, once you've collapsed, won't make
> sense
> >anymore in the new dimension because the new "observations" have changed. 
> For
> >example: if I have one line per person in a dataset, and each person can be
> >classified into a group using values 1 to 5, if I try to collapse the group
> >variable, it won't keep the values for everybody because the new dataset
> will
> >have been collapsed and each individual observation lost in that sense,
> unless
> >I've asked it to collapse individuals into their group categories, in which
> >case the end result will be a dataset with five observations:
> >"collapse (sum) population, by (group)"
> >will give me something like:
> >group    population
> >1              439
> >2           12,000
> >3            ....,   etc.
> >These are the most common mistakes I make that get mi in trouble with the
> >commands and by which I lose variables.  But if you send a bit more detail
> I
> >may be able to help you a bit more.
> >Good luck!
> >Julia A. Gamas
> >  
> >
> >>------------------------------
> >>
> >>Date: Thu, 17 Mar 2005 19:24:37 -0800
> >>From: Zhou YU <zyu@usc.edu>
> >>Subject: st: frustrated by missing variables--collapase and merge
> >>
> >>Hi all,
> >>
> >>I have been trying to collapse merge a number of variables. What 
> >>frustrated me is that there is always one or two variables missing after 
> >>collapsing or merging. Last night I have to repeated the same procedures 
> >>several times, which took me the whole evening to create a dataset. 
> >>Intestingly, each time, different variables were missing.
> >>
> >>Have anyone encountered the same problem? Any solutions?
> >>
> >>Thanks a bunch!
> >>
> >>Zhou
> >>    
> >>
> >
> >
> >
> >
> >
> >  
> >
> 
> 



*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index