Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Data Manipulation Question


From   Phil Schumm <[email protected]>
To   <[email protected]>
Subject   Re: st: Data Manipulation Question
Date   Thu, 19 Sep 2013 21:15:23 -0500

On Sep 19, 2013, at 10:24 AM, Alex Warofka <[email protected]> wrote:
> I'm trying to perform what seems like a relatively simple data manipulation task on a very large dataset (~20GB with 20 million observations), but having some difficulties wrapping my head around the best way to do so using Stata.


The calculation you describe below requires 3 variables: individual ID, employer ID and quarter.  If we (generously) allow 24-character strings for each ID and assume you have quarter stored as a double, then 20 million observations should be slightly over 1GB.  So, while the dataset is large, you shouldn't have to do anything special to accommodate the size from a storage perspective.


> I have four variables—individual ID (not unique because one individual can work for multiple employers in the same period), employer EIN, and quarter—and am attempting to flag events where >=80% of the employees working at a given EIN in quarter 1 move to the same different EIN in quarter 2 AND >=80% of the employees working at such an EIN in quarter 2 came from the same different EIN in quarter 1. In essence, the goal is to flag spurious transition events where an employer appears to change but in fact only their EIN has changed. This is the same procedure used in building the successor-predecessor file for the QWI and described in Census technical paper TP-2006-01.
> 
> My initial thought was to use levelsof and loop over EINs, pulling a local macro containing the IDs of employees for each EIN, then looping through these employees to see where they are working in Q2 and so on. This doesn't work as I run into the 67,784 character macro length limit. Splitting the dataset by quarter, merging, and then using _merge to track individual movements between firms doesn't work as my IDs are not unique.


Note that as of Stata 13 (SE), the maximum length of a macro is 1,081,511.  However, that's not needed for this problem.


> Does anyone have any recommendations for handling this in Stata? At this point, I'm becoming tempted to just write a Ruby script to do this, but would be thrilled to discover it was possible in Stata.


No need to resort to Ruby for this.  Here is a sketch of one possible strategy (it may not be the fastest, but may well be adequate in your case).  Assume that your variables are called ind (individual ID), emp (employer ID) and q (quarter):


    isid ind q, so
    by ind (q): gen emp1 = emp[_n-1] if emp[_n]!=emp[_n-1] & _n>1
    by ind (q): gen emp2 = emp[_n+1] if emp[_n]!=emp[_n+1] & _n<_N
    by ind (q): gen q1 = q[_n-1] if _n>1 & !mi(emp1)
    by ind (q): gen q2 = q[_n+1] if _n<_N & !mi(emp2)
    
    bys emp q: gen esize = _N
    bys emp q emp1 emp2 q1 q2: gen pswitch = _N/esize
    drop if mi(emp1) & mi(emp2)
    
    preserve
        keep q emp emp1 q1 pswitch
        ren (q emp pswitch) =2
        duplicates drop
        tempfile recipients
        save `"`recipients'"'
    restore
    
    keep q emp emp2 q2 pswitch
    ren (q emp pswitch) =1
    duplicates drop
    merge 1:1 emp1 emp2 q1 q2 using `"`recipients'"', keep(match)


This will generate a dataset in which each row represents a specific type of switch, as indicated by q1 and q2 (i.e., the quarters in which the switch began and in which it ended, respectively) and emp1 and emp2 (the original employer and the new employer, respectively).  In addition, the vars pswitch1 and pswitch2 represent the proportion of employees at the first employer who initiated the switch, and the proportion that they represent at the second employer after the switch has occurred.  Thus, according to the specification you described above, you would be interested in those observations for which


    q2 = q1 + 1 & pswitch1>=float(0.8) & pswitch2>=float(0.8)


-- Phil


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index