Statalist The Stata Listserver

[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

RE: st: [merging US industry level data]

From   "White, Justin" <[email protected]>
To   <[email protected]>
Subject   RE: st: [merging US industry level data]
Date   Fri, 29 Sep 2006 09:06:02 -0400

By looking at the NAICS codes, you would think you can aggregate up.
But there are problems with using this approach.  I am assuming that
Rohit is using QCEW (Quarterly Census of Employment and Wages) data from
BLS.  I would highly not recommend aggregating up.  That is why BLS
releases data at the various NAICS levels.  There are rounding issues
and most importantly there are disclosure issues.  The disclosures
issues protect small industries where there are fewer than 3 firms in a
particular NAICS in a particular geographic location or 1 firm that I
believe has 70% of the employment in that NAICS.

This is why it is crucial to download both 3-digit and 4-digit NAICS and
then append them together.  I have a few do files I have written that
does this.  Let me know if you would like to look at them.

Justin White

-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of Philipp Rehm
Sent: Friday, September 29, 2006 8:58 AM
To: [email protected]
Subject: Re: st: [merging US industry level data]

I am not familiar with NAICS, but a quick glance at seems to suggest that it

should be possible to aggregate up 4-digit level codes to 3-level codes.

For example, code 1111 (oilseed an grain farming) probably reports the 
sum (of whatever) for codes 11111 through 11119 (only the 5-digit 
codes). And the 3-digit code 111 probably reports the sum (of whatever) 
of codes 1111, 1112, 1113, 1114, and 1119

Depending on what Rohit wants to accomplish, it may or may not make 
sense to merge the two data-sets he has in mind, it may or may not make 
sense to append them (as Justin suggests). Alternatively, it also may or

may or may not make sense to collapse the data-set with the 4-digit 
level industry variable by a (to be generated) 3-digit level industry 
variable (here one would need to be very careful to avoid

As I said, I neither know the data-sets nor the industry classification 
(nor what Rohit wants to accomplish), so I cannot tell whether a merge 
makes more sense than an append.


White, Justin wrote:
> I would not use a merge.  Merge requires there be a common variable
> between datasets.  I use this same data a lot.  Since 3-digit and
> 4-digit NAICS data are not "similar", I would append the 3-digit data
> the 4-digit data and create a dummy variable indicating whether the
> observation is associated with 3-digit or 4-digit. 
> Hope this helps.
> Justin White
> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]] On Behalf Of Philipp
> Sent: Friday, September 29, 2006 8:41 AM
> To: [email protected]
> Subject: Re: st: [merging US industry level data]
> You don't give a whole lot of information about your data-set, but
> are a few things that can be said.
> 1) You need to generate the same industry level variable in both 
> data-sets, i.e. you need to generate a 3-digit level industry code 
> inside the data-set with the 4-level data-set (let's call this
> the 'master' data-set).
> It is not clear how the 4-digit and the 3-digit industry variables 
> relate to each other, but let's assume that you can simply cut off the

> last digit of the 4-digit variable to derive the 3-digit variable
> codes 1230 to 1239 at the 4-digit level correspond with code 123 at
> 3-digit level.
> Assuming this, as well as that your 4-digit level industry variable is

> coded in integers (and called industry_4d), you could get the 3-digit 
> level variable with something like this:
> gen int industry_3d = real(substr(string(industry_4d),1,3))
> In your other data-set, you also need to have a variable that is
> "industry_3d" (and you need to make sure that it is equivalently
> of course - which I assumed above).
> 2) Depending on what type of merge you want to do, you probably need
> sort both data-sets by the identifier variables (the variables you
> to merge on). Assuming you want to merge on, say, "year" and 
> "industry_3d", you would need to sort both data-sets by "year
> industry_3d."
> 3) The you can merge, along the following lines:
> use master.dta, clear
> merge year industry_3d using using.dta
> (where the data-set with the original 3-digit level industry level 
> variable is called "using.dta").
> HTH,
> Philipp
> Rohit wrote:
>> hi there,
>> mine is a very preliminary question. i am working with the US
> level
>> data and i want to merge the variables of 4-digit level industries to
> 3-digit
>> and also create a variable for 3-digit.
>> could anybody help me with that?
>> thanks
>> rohit
> *
> *   For searches and help try:
> *
> *
> *
> *
> *   For searches and help try:
> *
> *
> *
*   For searches and help try:

*   For searches and help try:

© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index