Statalist The Stata Listserver


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

RE: st: RE: RE: Encryption of data


From   "Hendri Adriaens" <[email protected]>
To   <[email protected]>
Subject   RE: st: RE: RE: Encryption of data
Date   Wed, 13 Jun 2007 18:58:58 +0200

Ok, thank you Nick,
-Hendri. 

> -----Original Message-----
> From: [email protected] 
> [mailto:[email protected]] On Behalf Of Nick Cox
> Sent: woensdag 13 juni 2007 18:32
> To: [email protected]
> Subject: RE: st: RE: RE: Encryption of data
> 
> I am as fond of -duplicates- as my twin, but it 
> is just a convenience command. 
> 
> bysort random : assert _N == 1 
> 
> is a much more direct way of testing that random
> numbers are unique. 
> 
> Nick 
> [email protected] 
> 
> Hendri Adriaens
>  
> > Hi William,
> > 
> > Thanks, that should work, although, as Nick Cox mentioned, 
> > there is a tiny
> > probability that you generate the same number twice. So, one 
> > might need a
> > check afterwards on duplicates and redo the process with a 
> > different seed if
> > there are.
>  
> William Gould, StataCorp LP
> 
> > > Hendri Adriaens <[email protected]> has a dataset and writes, 
> > > 
> > > > I want to encrypt only a single variable, to anonimize data.
> > > 
> > > Here is what I recommend.
> > > 
> > > Let's call the data actual.dta and assume it has variable 
> > > uid, which is 
> > > the official user identification number that we want to encrypt.
> > > uid can be a string or numeric, I don't care.  uid might contain
> > > 
> > >         136980408          recorded as a double or long, or 
> > >         "136-98-408"       recorded as a string, or even 
> > >         "James Smith"      recorded as a string.
> > > 
> > > In what follows, we will allow the repeated repeated values 
> > > of uid in the
> > > dataset.  What we are going to do is come up with new id 
> > > numbers, use those,
> > > and lock up the mapping of uid from newid.
> > > 
> > > Here's step 1:
> > > 
> > >         . use actual, clear 
> > >         . keep uid
> > >         . sort uid
> > >         . by uid: keep if _n==1
> > > 
> > >         . set seed _______            <- fill this in with a 
> > > random number
> > >         . gen double random = uniform()
> > >         . sort random 
> > >         . gen long newid = _n
> > > 
> > >         . sort uid
> > >         . save mapping, replace
> > > 
> > > New dataset mapping.dta contains two variables:  uid and the 
> > > corresponding 
> > > newid.  Next, we fix actual.dta for public consumption:
> > > 
> > >         . use actual 
> > >         . sort uid 
> > >         . merge uid using mapping
> > >         . assert _merge==3
> > >         . drop _merge uid
> > >         . save actual, replace
> > > 
> > > Finally, we put mapping.dta in a save place.  I would write 
> > > multiple copies 
> > > of actual.dta on multiple CDs and put the CDs in multiple 
> > > safes.  Dataset 
> > > mapping contains all the secret information.
> > > 
> > > Dataset actual.dta no longer contains uid; it contains newid.
> 
> *
> *   For searches and help try:
> *   http://www.stata.com/support/faqs/res/findit.html
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
> 
> 


*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index