Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: Re: Unix stata big dataset


From   "Michael Blasnik" <michael.blasnik@verizon.net>
To   <statalist@hsphsun2.harvard.edu>
Subject   st: Re: Unix stata big dataset
Date   Thu, 29 Nov 2007 17:03:25 -0500

...

I can't really comment on the cpu and memory usage report but I would guess that you could save a large fraction of the time for this operation if you told us more about the joinby you want to do:

1) How many observations are in each of the two files?

2) What type of merge do you need: one-to-one, one-to-many, many-to-one, or many-to-many? Only the last type needs -joinby-.

3) What proportion of the observations in each file do you expect to match? Does the large table contain lots of observations you don't need?

4) Are there any variables you don't need in either file that could be dropped first?

I think the biggest question is -- Are you sure that you need -joinby- rather than -merge-? Even if you need joinby, you may be able to do this much more quickly by first subsetting unique identifiers of the smaller file, then -merge- with the nokeep option to grab the useful observations in the large file and then go back to the smaller file to do a joinby on this subset file.

Also, do you have enough physical memory and an operating system that can allocate 2GB+ to Stata for loading the large dataset? If you are using virtual memory things can be very slow.

If you describe more about the data, there may be other approaches that reduce the memory requirements and speed the process.

Michael Blasnik


----- Original Message ----- From: <ncdcta00@uniroma2.it>
To: <statalist@hsphsun2.harvard.edu>
Sent: Thursday, November 29, 2007 4:36 PM
Subject: st: Unix stata big dataset



Dear Statalist,
I have a problem to joinby 2 datasets in unix, I have a dataset about 1,8 gb and other about 30 mg, I want to join this two dataset but in unix is very slow the process, and in 4 days I did'nt have a final dataset ( two month ago I join two dataset, more o less the same size, in only 1 day). I use a do file where I write my command joinby.
I look with the command top at the processor in local machine and my process is in state sleep. I use batch mode
11258 franz 1 20 0 0K 0K cpu/0 35.8H 24.41% stata
15566 ncd 1 20 0 0K 0K cpu/1 11:52 23.41% xstata
16084 ncd 1 60 0 0K 0K sleep 27:00 0.22% stata
so, if my processes is sleep means that it no functions? there is another user connected, can he influence my process?
thanks in advance for your help
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index