Statalist The Stata Listserver

[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: RE: RE: Encryption of data

From (William Gould, StataCorp LP)
Subject   Re: st: RE: RE: Encryption of data
Date   Wed, 13 Jun 2007 10:54:14 -0500

Hendri Adriaens <> has a dataset and writes, 

> I want to encrypt only a single variable, to anonimize data.

Here is what I recommend.

Let's call the data actual.dta and assume it has variable uid, which is 
the official user identification number that we want to encrypt.
uid can be a string or numeric, I don't care.  uid might contain

        136980408          recorded as a double or long, or 
        "136-98-408"       recorded as a string, or even 
        "James Smith"      recorded as a string.

In what follows, we will allow the repeated repeated values of uid in the
dataset.  What we are going to do is come up with new id numbers, use those,
and lock up the mapping of uid from newid.

Here's step 1:

        . use actual, clear 
        . keep uid
        . sort uid
        . by uid: keep if _n==1

        . set seed _______            <- fill this in with a random number
        . gen double random = uniform()
        . sort random 
        . gen long newid = _n

        . sort uid
        . save mapping, replace

New dataset mapping.dta contains two variables:  uid and the corresponding 
newid.  Next, we fix actual.dta for public consumption:

        . use actual 
        . sort uid 
        . merge uid using mapping
        . assert _merge==3
        . drop _merge uid
        . save actual, replace

Finally, we put mapping.dta in a save place.  I would write multiple copies 
of actual.dta on multiple CDs and put the CDs in multiple safes.  Dataset 
mapping contains all the secret information.

Dataset actual.dta no longer contains uid; it contains newid.

-- Bill
*   For searches and help try:

© Copyright 1996–2017 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index