Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down at the end of May, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: AW: combining datasets


From   Anders Alexandersson <andersalex@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: AW: combining datasets
Date   Thu, 19 Aug 2010 14:07:21 -0400

Martin, with "large" datasets I meant many variables rather than many
observations in the dataset.
For -append-, a common problem is differing variable names across datasets.
For -merge-, a problem is common variable names (on the non-key variables).
With many variables in each dataset, say 1000s of variables, the
amount of pre-combining work you have to do
can be large. In the example dataset given by Maarten,
it was only a one-line difference between the -append- and -merge- solutions.

Anders

On Thu, Aug 19, 2010 at 12:30 PM, Martin Weiss <martin.weiss1@gmx.de> wrote:
>
> <>
>
> " The choice between append and merge is more important for large datasets
> because you need the right variable naming scheme."
>
>
>
> I do not really understand the meaning of this sentence. Why would the
> situation change given the size of the dataset at hand?
>
> -append- and -merge- are not slight variations of each other, IMHO. The
> manual entry for -merge- does make clear the many variations _within_
> -merge- itself, but the choice between -append- and -merge- is more
> fundamental still...
>
> Also note [D], p. 397:
>
> " merge is for adding new variables from a second dataset to existing
> observations. You use
> merge, for instance, when combining hospital patient and discharge datasets.
> If you wish to add new
> observations to existing variables, then see [D] append. You use append, for
> instance, when adding
> current discharges to past discharges."
>
>
> HTH
> Martin
>
>
> -----Ursprüngliche Nachricht-----
> Von: owner-statalist@hsphsun2.harvard.edu
> [mailto:owner-statalist@hsphsun2.harvard.edu] Im Auftrag von Anders
> Alexandersson
> Gesendet: Donnerstag, 19. August 2010 17:56
> An: statalist@hsphsun2.harvard.edu
> Betreff: Re: st: AW: combining datasets
>
> Martine,
>
> Also see [U] 22 Combining datasets. Maarten provided an excellent
> append solution with this being the main line:
> . append using `a'
>
> Here is the equivalent merge solution:
> . merge 1:1 source id using `a', nogen
>
> The choice between append and merge is more important for large
> datasets because you need the right variable naming scheme.
> Michael Mitchell gave a good tip in his data management book described
> at http://www.stata.com/bookstore/dmus.html :
> If you will append datasets, you want the variable names to be the same,
> but if you will merge datasets, you want the variable names to be different.
>
> Anders Alexandersson
> andersalex@gmail.com
>
> On Thu, Aug 19, 2010 at 4:34 AM, Maarten buis <maartenbuis@yahoo.co.uk>
> wrote:
>> --- On Wed, 18/8/10, martine etienne wrote:
>>> firstly, person 1 in dataset A is NOT same person as person
>>> 1 in dataset B, measurements are also taken at different times
>>> secondly, I would like the final dataset to look like Final 1
>>
>> Here is an example of how to do that:
>>
>> *------------ begin example ------------
>> // create the two datasets
>> tempfile a b
>>
>> drop _all
>> input id x
>> 1  3
>> 2  4
>> end
>> save `a'
>>
>> drop _all
>> input id x
>> 1  5
>> 2  6
>> end
>> save `b'
>>
>> // create a new variable in each dataset
>> // that identifies the source of those
>> // observations
>> use `a'
>> gen source = "a"
>>
>> save `a', replace
>>
>> use `b'
>> gen source = "b"
>> save `b', replace
>>
>> // use -append- to stack the datasets
>> append using `a'
>>
>> // create a extra id variable, which contains
>> // an unique integer for each source-id combination
>> // and attaches the values of the source and id
>> // variables to the value label
>> egen long new_id = group(source id), label
>>
>> // for display purposes I put the thre id variables
>> // to the left of the dataset
>> order id source new_id
>>
>> // display the result
>> list
>> *--------------- end example ----------------
>> (For more on examples I sent to the Statalist see:
>> http://www.maartenbuis.nl/example_faq )
>>
>> Hope this helps,
>> Maarten
>>
>> --------------------------
>> Maarten L. Buis
>> Institut fuer Soziologie
>> Universitaet Tuebingen
>> Wilhelmstrasse 36
>> 72074 Tuebingen
>> Germany
>>
>> http://www.maartenbuis.nl
>> --------------------------

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index