Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: Data management optoins


From   wgould@stata.com (William Gould, Stata)
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Data management optoins
Date   Thu, 06 Feb 2003 08:59:26 -0600

Glenn Hoetker <ghoetker@uiuc.edu> writes, 

> I have an issue with large datasets, and am hoping for some advice on how to
> best handle it.  To simplify the issue somewhat, [...]
> 
> [...] doing this in Stata is proving more challenging.  The improved merge
> command in version 8 helps a bit, but I'm still having to rename variables
> repeatedly, save interim datasets, and sort large datasets in different
> ways.  [...]

Let me setup Glenn's problem and show how I would go about solving it.
I am not sure this will be helpful because, perhaps, this is just what 
Glenn has already done.


Description of problem
----------------------

We have two datasets, containing

        PATIENTS.dta
            variables:   patno          x1               x2 ...


        CITATIONS.dta
            variables:   patno_citing   patno_cited 

To do:  Create new dataset containing

        COMBINED.dta:
            variables    patno x1 x2 ... patno_citing x1_citing x2_citing ...


Modification of problem
-----------------------

Rather than creating COMBINED.DTA, we will create 

        UNCITED.dta
            variables    patno x1 x2 
        
        CITED.dta
            variables    patno x1 x2 ... patno_citing x1_citing x2_citing ...

These two datasets -append-ed together will be equal to COMBINED.dta.  
Doing this will save a little memory, if that matters.


Solution
--------
        // Step 1:  make UNCITED.dta and 
        //          make TMP_CITED.dta = [PATIENTS.dta] w/ var patno_citing

        . use CITATIONS
        . sort patno_cited
        . rename patno_cited patno
        . save TMP1  

        . use PATIENTS
        . sort patno 
        . merge patno using TMP1, nokeep
        . save TMPRES

        . keep if _merge==1
        . drop _merge 
        . save UNCITED

        . use TMPRES 
        . drop if _merge==1
        . drop _merge
        . save TMP_CITED

        . erase TMPRES.dta
        . erase TMP1.dta


        // Step 2:
        // take TMP_CITED.dta = [PATIENTS.dta] w/ var patno_citing
        // and merge to add x1_citing, x2_citing, ...

        . use PATIENTS
        . rename x1 x1_citing
        . rename x2 x2_citing
        . ...
        . rename patno patno_citing
        . sort patno_citing
        . save TMP2

        . use TMP_CITED
        . sort patno_citing
        . merge patno_citing using TMP2, nokeep

        . assert _merge==3
        . drop _merge

        . save CITED
        . erase TMP2.dta
        . erase TMP_CITED.dta

-- Bill
wgould@stata.com
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index