Home  /  Resources & support  /  FAQs  /  Efficiently defining group characteristics to create subsets

How do you define group characteristics in your data in order to create subsets?

Title   Efficiently defining group characteristics to create subsets
Author Christopher F. Baum, Boston College

Say that your cross-sectional dataset contains microdata—a record for each employee, for instance—and you want to associate each employee's workplace with an industry code. That information is not on the record but is available to you. How do you get this associated information (which might also be, e.g., the code for a specific pension plan or the state) on the record without manual editing or a long sequence of statements with if clauses? The latter method is perhaps familiar to users of other statistical packages, but there is a better way.

Let us presume that we have Stata dataset employee containing the individual-specific measurements as well as wpid, the workplace ID. Assume that it can be dealt with as an integer; if it were a string code, that could easily be handled as well.

Create a text file containing two columns: the workplace ID (wpid) and the industry code (indcod). For instance,

        12367  321
        12467  313
        13211  321
        ...    ...
        23435  371
        32156  341

Read the file into Stata with infile wpid indcod, sort wpid, and save as Stata dataset wpchar.

Now use the employee file and give the commands

        . sort wpid
        . merge m:1 wpid using wpchar
        . tab _merge

You should find that all employees now have an indcod variable defined. If there are missing values in indcod, list the wpids for which indcod is missing (presuming that you have industry codes for all workplaces). When you are satisfied that the merge has worked properly, type

        . drop _merge

This is a good example of the power and flexibility of Stata’s merge command. The merge facility does not perform just one-to-one merges; in this example, it performs a one-to-many merge, associating a workplace with each of the employees at that workplace. A clear advantage of this technique appears when you have more than one characteristic to be added to each employee record, for instance, an industry code and the number of employees of the firm, the total sales of the firm, etc. Any number of such firm-level variables could be added to the records in the wpchar file and merged onto the employee file with the same command.

Unlike an approach depending on a long list of conditional statements, replace indcod=321 if inlist(wpid,12367,13211,...), this approach provides a Stata dataset containing your workplace ID numbers, so that you may easily see whether you have a particular code in your list. This approach would be especially useful if you revise the list for a new set of workplaces, etc.