Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: RE: Data Manipulation Question

From	Joe Canner <[email protected]>
To	"[email protected]" <[email protected]>
Subject	st: RE: Data Manipulation Question
Date	Thu, 19 Sep 2013 18:31:14 +0000

Alex,

Very interesting problem...

First, I assume when you say that the individual ID is "not unique" you mean that there are multiple records for the same people at different time points, not that there are ID numbers that are shared my more than one person?  With that assumption, I have some thoughts:

1. Do you have exact dates or just quarter?  If you have dates, or even a sequence within quarter, for each employee you could save the last EIN in Q1 and the first EIN in Q2, thus making the employee IDs unique and allowing a 1:1 merge.

2. One way to deal with the non-uniqueness is to use -joinby-, which produces an observation for all possible combinations of EINS observed between the two quarters.  Unless a lot of people are changing jobs a lot of times, this shouldn't generate a lot of extra data and the result will allow you to look for certain high-frequency EIN pairs, just as you were hoping to do with -merge-. 

3. Alternatively, depending on the magnitude of non-unique IDs, you might even be able to arbitrarily delete duplicates and still have enough data to be able to see the patterns you are looking for.

4. Depending on how much job movement there is (i.e., if there aren't huge layoffs or huge hiring binges in a single quarter), you may even be able to get away with looking at frequencies by EIN and looking to see how much those frequencies change from quarter to quarter.  If an employer goes from X employees in Q1 to X*0.2 in Q2, you could flag that as suspicious, then look for cases where an employer goes from Y employees in Q1 to Y*5 in Q2, and then try to match them up one-by-one by matching the frequency changes or, if necessary, by individual IDs.  In other words, perhaps you don't necessarily need to track individuals in order to see patterns.

Sorry, no great insights, just a few thoughts....

Regards,
Joe Canner
Johns Hopkins University School of Medicine

-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Alex Warofka
Sent: Thursday, September 19, 2013 11:25 AM
To: statalist
Subject: st: Data Manipulation Question

Hi,

I'm trying to perform what seems like a relatively simple data
manipulation task on a very large dataset (~20GB with 20 million
observations), but having some difficulties wrapping my head around
the best way to do so using Stata.

I have four variables—individual ID (not unique because one individual
can work for multiple employers in the same period), employer EIN, and
quarter—and am attempting to flag events where >=80% of the employees
working at a given EIN in quarter 1 move to the same different EIN in
quarter 2 AND >=80% of the employees working at such an EIN in quarter
2 came from the same different EIN in quarter 1. In essence, the goal
is to flag spurious transition events where an employer appears to
change but in fact only their EIN has changed. This is the same
procedure used in building the successor-predecessor file for the QWI
and described in Census technical paper TP-2006-01.

My initial thought was to use levelsof and loop over EINs, pulling a
local macro containing the IDs of employees for each EIN, then looping
through these employees to see where they are working in Q2 and so on.
This doesn't work as I run into the 67,784 character macro length
limit. Splitting the dataset by quarter, merging, and then using
_merge to track individual movements between firms doesn't work as my
IDs are not unique.

Does anyone have any recommendations for handling this in Stata? At
this point, I'm becoming tempted to just write a Ruby script to do
this, but would be thrilled to discover it was possible in Stata.

Thanks,

-- 
Alex Warofka
Research Associate | California Center for Population Research, UCLA
[email protected] | nomad.cm | @AlexWarofka

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: RE: Data Manipulation Question
  - From: Robert Picard <[email protected]>

References:
- st: Data Manipulation Question
  - From: Alex Warofka <[email protected]>

Prev by Date: Re: st: difficulty in understanding link function in glm
Next by Date: Re: Re: st: Predicted probabilities for logistic regression after svy
Previous by thread: st: Data Manipulation Question
Next by thread: Re: st: RE: Data Manipulation Question
Index(es):
- Date
- Thread