Statalist The Stata Listserver


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: RE: RE: Encryption of data


From   [email protected] (William Gould, StataCorp LP)
To   [email protected]
Subject   Re: st: RE: RE: Encryption of data
Date   Wed, 13 Jun 2007 12:25:16 -0500

Since I responded to Hendri Adriaens <[email protected]> question, 
who wrote that he has a dataset 

> I want to encrypt only a single variable, to anonimize data.

There have been a flurry of other responses, most focusing on cryptography.  I
worry that someone will think that the "cryptographic" solution is better, so
I want to address that.  In addition, Nick Cox <[email protected]> wrote,
"There is a minute but non-zero chance of ties on numbers drawn using
-uniform()-", which is true, and he went on to worry that would somehow
undermine what I suggested.


1.  Crypotographic solutions
----------------------------

My solution, also independently suggested by by Maarten Buis
<[email protected]>, IS IN FACT a crpytographic solution; it goes under the
name "one-time pad".  In our solution, the pad is applied to ids as a whole,
rather than to the digits and letters that make them up, but that is
irrelevant.  The one-time pad is the strongest cryptographic solution known to
man.  In fact, it can be proven that no stronger solution exists because 
one-time pads CANNOT BE BROKEN!  The only attack available is to steal
the mapping dataset.

The method Maarten and I suggested is not a pure one-time pad, however.
Both of us used Stata's random-number generater, and assumed a seed 
provided by the user.  A real one-time pad would get the random numbers 
from a real random process, not a pseudo random one.  The psuedo-random 
process is open to attack.

The fact that Maartin and I choose to map entire ids rather than the digits 
and letter in them reduces the chances of success of this kind of attack.
When using pseudo-random number, the rule is the fewer, the better.

The biggest weakness in our solution is in the selection of the seed 
by a human.  Humans do not choose randomly among all the integers 
available, they choose among among the subset the look more random to
them, and they choose short ones.



2.  Effects of ties from uniform()
----------------------------------

Nick Cox is absolutely right that -uniform()- can produce equal values,
although it is unlikly to do so.  Note that I stored the -uniform()- 
result as a doiuble.  Anyway, Nick Cox is wrong in assuming that those
equal values cause any cryptographic problem.  It is not a problem because
Stata's -sort- algorithm breaks ties randomly unless you specify the -stable-
option, and randomness is exactly what we require.

Now in fact, -sort- breaks ties pseudo randomly, so (1) applies.

It is true that, if you ran the code Maartin and I suggested twice in a row,
you might get a different mapping, but that doesn't matter.  In fact,
reproducibility is not only not required in most cyrptographic situations, it
is not even desirable.

Hendri Adriaens <[email protected]> wrote, 

> [...] as Nick Cox mentioned, there is a tiny probability that you generate
> the same number twice. So, one might need a check afterwards on duplicates
> and redo the process with a different seed if there are.

There is no additional security to be gained by doing that.  Ties do not
matter in this case.


-- Bill
[email protected]
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index