[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: Re: Unix stata big dataset

Subject   Re: st: Re: Unix stata big dataset
Date   Thu, 29 Nov 2007 23:36:18 +0100

The memory that I can allocate is 15 gb, the total observations are 18 millions

I have duplicate observation but I can't drop because they are the spell of work for each person, and I need these observations.
The two dataset have in common the same id, so I need to match the data set in booth but id is not unique.

so, one data set is
id x1 x3 x 4...
1 0 1991 1998
1 1 1991 1998
1 2 1999 1999
and second is:
id y1 y2 y3
1 34 2 35
1 34 2 67
1 34 1 68

the idea is to keep all the people that have the same id to obtainer this data set
id xi x2 x3 y1 y2 y3

Sorry I don't understand the last part of email , how to do the merge
thanks a lot for your help

Quoting Michael Blasnik <>:


I can't really comment on the cpu and memory usage report but I would
guess that you could save a large fraction of the time for this
operation if you told us more about the joinby you want to do:

1) How many observations are in each of the two files?

2) What type of merge do you need: one-to-one, one-to-many,
many-to-one, or many-to-many?  Only the last type needs -joinby-.

3) What proportion of the observations in each file do you expect to
match? Does the large table contain lots of observations you don't need?

4) Are there any variables you don't need in either file that could be
dropped first?

I think the biggest question is -- Are you sure that you need -joinby-
rather than -merge-?  Even if you need joinby, you may be able to do
this much more quickly by first subsetting unique identifiers of the
smaller file, then -merge- with the nokeep option to grab the useful
observations in the large file and then go back to the smaller file to
do a joinby on this subset file.

Also, do you have enough physical memory and an operating system that
can allocate 2GB+ to Stata for loading the large dataset?  If you are
using virtual memory things can be very slow.

If you describe more about the data, there may be other approaches that
reduce the memory requirements and speed the process.

Michael Blasnik

----- Original Message ----- From: <>
To: <>
Sent: Thursday, November 29, 2007 4:36 PM
Subject: st: Unix stata big dataset

Dear Statalist,
I have a problem to joinby 2 datasets in unix, I have a dataset about 1,8 gb and other about 30 mg, I want to join this two dataset but in unix is very slow the process, and in 4 days I did'nt have a final dataset ( two month ago I join two dataset, more o less the same size, in only 1 day). I use a do file where I write my command joinby.
I look with the command top at the processor in local machine and my process is in state sleep. I use batch mode
11258 franz 1 20 0 0K 0K cpu/0 35.8H 24.41% stata
15566 ncd 1 20 0 0K 0K cpu/1 11:52 23.41% xstata
16084 ncd 1 60 0 0K 0K sleep 27:00 0.22% stata
so, if my processes is sleep means that it no functions? there is another user connected, can he influence my process?
thanks in advance for your help
*   For searches and help try:

Catia Nicodemo

*   For searches and help try:

© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index