Statalist The Stata Listserver

[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: [merging US industry level data]

From   Philipp Rehm <>
Subject   Re: st: [merging US industry level data]
Date   Fri, 29 Sep 2006 14:58:06 +0200

I am not familiar with NAICS, but a quick glance at seems to suggest that it should be possible to aggregate up 4-digit level codes to 3-level codes. For example, code 1111 (oilseed an grain farming) probably reports the sum (of whatever) for codes 11111 through 11119 (only the 5-digit codes). And the 3-digit code 111 probably reports the sum (of whatever) of codes 1111, 1112, 1113, 1114, and 1119

Depending on what Rohit wants to accomplish, it may or may not make sense to merge the two data-sets he has in mind, it may or may not make sense to append them (as Justin suggests). Alternatively, it also may or may or may not make sense to collapse the data-set with the 4-digit level industry variable by a (to be generated) 3-digit level industry variable (here one would need to be very careful to avoid double-counting).

As I said, I neither know the data-sets nor the industry classification (nor what Rohit wants to accomplish), so I cannot tell whether a merge makes more sense than an append.


White, Justin wrote:

I would not use a merge. Merge requires there be a common variable
between datasets. I use this same data a lot. Since 3-digit and
4-digit NAICS data are not "similar", I would append the 3-digit data to
the 4-digit data and create a dummy variable indicating whether the
observation is associated with 3-digit or 4-digit.
Hope this helps.

Justin White

-----Original Message-----
[] On Behalf Of Philipp Rehm
Sent: Friday, September 29, 2006 8:41 AM
Subject: Re: st: [merging US industry level data]

You don't give a whole lot of information about your data-set, but there

are a few things that can be said.

1) You need to generate the same industry level variable in both data-sets, i.e. you need to generate a 3-digit level industry code inside the data-set with the 4-level data-set (let's call this data-set the 'master' data-set).
It is not clear how the 4-digit and the 3-digit industry variables relate to each other, but let's assume that you can simply cut off the last digit of the 4-digit variable to derive the 3-digit variable (e.g.,

codes 1230 to 1239 at the 4-digit level correspond with code 123 at the 3-digit level.

Assuming this, as well as that your 4-digit level industry variable is coded in integers (and called industry_4d), you could get the 3-digit level variable with something like this:

gen int industry_3d = real(substr(string(industry_4d),1,3))

In your other data-set, you also need to have a variable that is called "industry_3d" (and you need to make sure that it is equivalently coded, of course - which I assumed above).

2) Depending on what type of merge you want to do, you probably need to sort both data-sets by the identifier variables (the variables you want to merge on). Assuming you want to merge on, say, "year" and "industry_3d", you would need to sort both data-sets by "year

3) The you can merge, along the following lines:
use master.dta, clear
merge year industry_3d using using.dta

(where the data-set with the original 3-digit level industry level variable is called "using.dta").


Rohit wrote:

hi there,
mine is a very preliminary question. i am working with the US industry
data and i want to merge the variables of 4-digit level industries to
and also create a variable for 3-digit.
could anybody help me with that?
*   For searches and help try:

*   For searches and help try:

*   For searches and help try:

© Copyright 1996–2015 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index