[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: AW: utility to create fake dataset?

From	Jeph Herrin <[email protected]>
To	[email protected]
Subject	Re: st: AW: utility to create fake dataset?
Date	Sun, 08 Nov 2009 15:56:49 -0500


Daljit,

Thank you, yes, data masking is what I want, the key
element being to "replace data with realistic but not
real data."

For instance, a dataset of health insurance claims data may
have not only obvious identifiers such as name, address,
social security (US national ID, sort of) number - these
can be easily blanked out, but they aren't used for analysis
anyway. More critically, current US regulations would say
that data including a procedure, sex, age, and admission date
can not be moved off a secure site, as they are potentially
enough info to identify an individual. Sometimes I am able
to have this data locally, other times I am not - in the
latter case, I sometimes create a fake dataset which has
the exact variables, but with random data replacing the real
data. Then I can work locally to develop programs which I can
then run remotely on the server, as working remotely is very
inefficient.

If I am only concerned about age, sex, and admission date, it's
fairly simply to replace each with random variables that have the
same mean as the real ones and just go from there. Other times,
a dataset may have tens of variables (eg, lab data) that must
be masked. So I would like a utility that is invoked like this

 mask [varlist]

and it does all the work for me. This is fairly straightforward
to write, but I thought that people who'd thought about this might
have found more clever ways to ensure that a real dataset can
not possible be re-engineered from a fake one.

cheers,
Jeph


Daljit Dhadwal wrote:

It sounds like youre trying to create anonymized data sets.  There
are lots of different names for the techniques for doing this: data
masking, data anonymization, data obfuscation, data de-identification,
data depersonalization, data scrubbing, and data scrambling.

Here’s the Wikipedia article on data masking:
http://en.wikipedia.org/wiki/Data_masking

Here’s a good powerpoint presentation that discusses some of the
techniques used in data masking:
http://www.cs.uky.edu/events/dmSec08.ppt

Thanks,

Daljit


On Sun, Nov 8, 2009 at 9:32 AM, Martin Weiss <[email protected]> wrote:

<>



*************
h clonevar
*************

comes to mind...


HTH
Martin


-----Ursprüngliche Nachricht-----
Von: [email protected]
[mailto:[email protected]] Im Auftrag von Jeph Herrin
Gesendet: Sonntag, 8. November 2009 18:20
An: [email protected]
Betreff: st: utility to create fake dataset?


I sometimes need to create a "fake" dataset that "looks?
like an existing dataset. For example, a dataset that
must, for health privacy reasons, remain on a remote server,
and I would like to develop code locally to run on it.
Or, I need to make mock tables to share with colleagues
who need to remain blinded for now to actual study data.

Usually, I just do something that seems "good enough", like
sample 5%, expand 20, replace values with random values, etc.
Or, in an extreme case, set obs to be twice the existing obs
and keep the ones with missing data. But the first is not
very satisfying when I need to reassure higher powers that
I have a "dummy" dataset, and the second is not very helpful
for writing final useable code.

So, I'm thinking I'll write a utility to create a 'dummy'
dataset from an existing dataset, but wondered if there was
something out there already. Perhaps there is even a well
established name for this process? My searches for "dummy"
and "fake" dataset have not been fruitful.

thanks,
Jeph


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

References:
- st: utility to create fake dataset?
  - From: Jeph Herrin <[email protected]>
- Re: st: AW: utility to create fake dataset?
  - From: Daljit Dhadwal <[email protected]>

Prev by Date: Re: st: Stata 10 is Malware (also version 11?)
Next by Date: Re: st: re: referencing eresult matrices
Previous by thread: Re: st: AW: utility to create fake dataset?
Index(es):
- Date
- Thread