Stata 11 help for datasignature

help datasignature dialog: datasignature -------------------------------------------------------------------------------

Title

[D] datasignature -- Determine whether data have changed

Syntax

datasignature

datasignature set [, reset ]

datasignature confirm [, strict ]

datasignature report

datasignature set, saving(filename[, replace]) [ reset ]

datasignature confirm using filename [, strict ]

datasignature report using filename

datasignature clear

+---------------------------------------------------------------------+ | Note: datasignature was introduced during the Stata 9 release. | | This is not that command. The new datasignature | | command is easier to use and has new capabilities. | | | | The original is now named _datasignature and is documented | | in [P] _datasignature. Under version control, datasignature | | becomes _datasignature. | | | | Programmers will still be interested in [P] _datasignature. | | datasignature is implemented in terms of _datasignature. | +---------------------------------------------------------------------+

Menu

Data > Other utilities > Manage data signature

Description

These commands calculate, display, save, and verify checksums of the data, which taken together form what is called a signature. An example signature is 162:11(12321):2725060400:4007406597. That signature is a function of the values of the variables and their names, and thus the signature can be used later to determine whether a dataset has changed.

datasignature without arguments calculates and displays the signature of the data in memory.

datasignature set does the same, and it stores the signature as a characteristic in the dataset. You should save the dataset afterward so that the signature becomes a permanent part of the dataset.

datasignature confirm verifies that, were the signature recalculated this instant, it would match the one previously set. datasignature confirm displays an error message and returns a nonzero return code if the signatures do not match.

datasignature report displays a full report comparing the previously set signature to the current one.

In the above, the signature is stored in the dataset and accessed from it. The signature can also be stored in a separate, small file.

datasignature set, saving(filename) calculates and displays the signature and, in addition to storing it as a characteristic in the dataset, also saves the signature in filename.

datasignature confirm using filename verifies that the current signature matches the one stored in filename.

datasignature report using filename displays a full report comparing the current signature with the one stored in filename.

In all the above, if filename is specified without an extension, .dtasig is assumed.

datasignature clear clears the signature, if any, stored in the characteristics of the dataset in memory.

Options

reset is used with datasignature set. It specifies that even though you have previously set a signature, you want to erase the old signature and replace it with the current one.

strict is for use with datasignature confirm. It specifies that, in addition to requiring that the signatures match, you also wish to require that the variables be in the same order and that no new variables have been added to the dataset. (If any variables were dropped, the signatures would not match.)

saving(filename[, replace]) is used with datasignature set. It specifies that, in addition to storing the signature in the dataset, you want a copy of the signature saved in a separate file. If filename is specified without a suffix, .dtasig is assumed. The replace suboption allows filename to be replaced if it already exists.

Remarks

Remarks are presented under the following headings:

Using datasignature interactively Example 1: Verification at a distance Example 2: Protecting yourself from yourself Example 3: Working with assistants Example 4: Working with shared data Using datasignature in do-files Interpreting data signatures The logic of data signatures

Using datasignature interactively

datasignature is useful in the following cases:

1. You and a coworker, separated by distance, have both received what is claimed to be the same dataset. You wish to verify that it is.

2. You work interactively and realize that you could mistakenly modify your data. You wish to guard against that.

3. You want to give your dataset to an assistant to improve the labels and the like. You wish to verify that the data returned to you are the same data.

4. You work with an important dataset served on a network drive. You wish to verify that others have not changed it.

Example 1: Verification at a distance

You load the data and type

. datasignature 74:12(71728):3831085005:1395876116

Your coworker does the same with his or her copy. You compare the two signatures.

Example 2: Protecting yourself from yourself

You load the data and type

. datasignature set 74:12(71728):3831085005:1395876116 (data signature set)

. save, replace

From then on, you periodically type

. datasignature confirm (data unchanged since 19feb2007 14:24)

One day, however, you check and see the message:

. datasignature confirm (data unchanged since 19feb2007 14:24, except 2 variables have been added)

You can find out more by typing

. datasignature report (data signature set on Monday 19feb2007 14:24)

Data signature summary

1. Previous data signature 74:12(71728):3831085005:1395876116 2. Same data signature today (same as 1) 3. Full data signature today 74:14(113906):1142538197:2410350265

Comparison of current data with previously set data signature

Variables No. Notes ------------------------------------------------------------ Original # of variables 12 (values unchanged) Added variables 2 (note 1) Dropped variables 0 ------------------------------------------------------------ Resulting # of variables 14

(1) Added variables are agesquared logincome.

You could now either drop the added variables or decide to incorporate them:

. datasignature set data signature already set -- specify option -reset- r(198)

. datasignature set, reset 74:14(113906):1142538197:2410350265 (data signature reset)

Concerning the detailed report, three data signatures are reported: 1) the stored signature, 2) the signature that would be calculated today on the basis of the same variables in their original order, and (3) the signature that would be calculated today on the basis of all the variables and in their current order.

datasignature confirm knew that new variables had been added because 1) was equal to 2). If some variables had been dropped, however, datasignature confirm would not be able to determine whether the remaining variables had changed.

Example 3: Working with assistants

You give your dataset to an assistant to have variable labels and the like added. You wish to verify that the returned data are the same data.

Saving the signature with the dataset is inadequate here. Your assistant, having your dataset, could change both your data and the signature and might even do that in a desire to be helpful. The solution is to save the signature in a separate file that you do not give to your assistant:

. datasignature set, saving(mycopy) 74:12(71728):3831085005:1395876116 (data signature set) (file mycopy.dtasig saved)

You keep file mycopy.dtasig. When your assistant returns the dataset to you, you use it and compare the current signature to what you have stored in mycopy.dtasig:

. datasignature confirm using mycopy (data unchanged since 19feb2007 15:05)

By the way, the signature is a function of the following:

o The number of observations and number of variables in the data

o The values of the variables

o The names of the variables

o The order in which the variables occur in the dataset

o The storage types of the individual variables

The signature is not a function of variable labels, value labels, notes, and the like.

Example 4: Working with shared data

You work on a dataset served on a network drive, which means that others could change the data. You wish to know whether this occurs.

The solution here is the same as working with an assistant: you save the signature in a separate, private file on your computer,

. datasignature set, saving(private) 74:12(71728):3831085005:1395876116 (data signature set) (file private.dtasig saved)

and then you periodically check the signature by typing

. datasignature confirm using private (data unchanged since 15mar2007 11:22)

Using datasignature in do-files

datasignature confirm aborts with error if the signatures do not match:

. datasignature confirm data have changed since 19feb2007 15:05 r(9);

This means that, if you use datasignature confirm in a do-file, execution of the do-file will be stopped if the data have changed.

You may want to specify the strict option. strict adds two more requirements: that the variables be in the same order and that no new variables have been added. Without strict, these are not considered errors:

. datasignature confirm (data unchanged since 19feb2007 15:22)

. datasignature confirm, strict (data unchanged since 19feb2007 15:05, but order of variables has changed) r(9);

and

. datasignature confirm (data unchanged since 19feb2007 15:22, except 1 variable has been added) . datasignature confirm, strict (data unchanged since 19feb2007 15:22, except 1 variable has been added) r(9);

If you keep logs of your analyses, issuing datasignature or datasignature confirm immediately after loading each dataset is a good idea. This way, you have a permanent record that you can use for comparison.

Interpreting data signatures

An example signature is 74:12(71728):3831085005:1395876116. The components are

1. 74, the number of observations;

2. 12, the number of variables;

3. 71728, a checksum function of the variable names and the order in which they occur; and

4. 3831085005 and 1395876116, checksum functions of the values of the variables, calculated two different ways.

Two signatures are equal only if all their components are equal.

Two different datasets will probably not have the same signature, and it is even more unlikely that datasets containing similar values will have equal signatures. There are two data checksums, but do not read too much into that. If either data checksum changes, even just a little, the data have changed. Whether the change in the checksum is large or small -- or in one, the other, or both -- signifies nothing.

The logic of data signatures

The components of a data signature are known as checksums. The checksums are many-to-one mappings of the data onto the integers. Let's consider the checksums of auto.dta carefully.

The data portion of auto.dta contains 38,184 bytes. There are 256^38184 such datasets or, equivalently, 2^305472. The first checksum has 2^48 possible values, and it can be proven that those values are equally distributed over the 2^305472 datasets. Thus there are 2^305472/2^48 - 1 = 2^305424 - 1 datasets that have the same first checksum value as auto.dta. The same can be said for the second checksum. It would be difficult to prove, but we believe that the two checksums are conditionally independent, being based on different bit shifts and bit shuffles of the same data. Of the 2^305424 - 1 datasets that have the same first checksum as auto.dta, the second checksum should be equally distributed over them. Thus there are about 2^305376 - 1 datasets with the same first and second checksums as auto.dta.

Now let's consider those 2^305376 - 1 other datasets. Most of them look nothing like auto.dta. The checksum formulas guarantee that a change of one variable in 1 observation will lead to a change in the calculated result if the value changed is stored in 4 or fewer bytes, and they nearly guarantee it in other cases. When it is not guaranteed, the change cannot be subtle -- "Chevrolet" will have to change to binary junk, or a double-precision 1 to -6.476678983751e+301, and so on. The change will be easily detected if you summarize your data and just glance at the minimums and maximums. If the data look at all like auto.dta, which is unlikely, they will look like a corrupted version.

More interesting are offsetting changes across observations. For instance, can you change one variable in 1 observation and make an offsetting change in another observation so that, taken together, they will go undetected? You can fool one of the checksums, but fooling both of them simultaneously will prove difficult. The basic rule is that the more changes you make, the easier it is to create a dataset with the same checksums as auto.dta, but by the time you've done that, the data will look nothing like auto.dta.

Saved results

datasignature without arguments and datasignature set save the following in r():

Macros r(datasignature) the signature

datasignature confirm saves the following in r():

Scalars r(added) number of variables added

Macros r(datasignature) the signature

datasignature confirm aborts execution if the signatures do not match and so then returns nothing except a return code of 9.

datasignature report saves the following in r():

Scalars r(datetime) %tc date-time when set r(changed) . if r(k_dropped)!=0, otherwise 0 if data have not changed, 1 if data have changed r(reordered) 1 if variables reordered, 0 if not reordered, . if r(k_added)!=0 | r(k_dropped)!=0 r(k_original) number of original variables r(k_added) number of added variables r(k_dropped) number of dropped variables Macros r(origdatasignature) original signature r(curdatasignature) current signature on same variables, if it can be calculated r(fulldatasignature) current full-data signature r(varsadded) variable names added r(varsdropped) variable names dropped

datasignature clear saves nothing in r() but does clear it.

datasignature set stores the signature in the following characteristics:

Characteristic _dta[datasignature_si] signature _dta[datasignature_dt] %tc date-time when set in %21x format _dta[datasignature_vl1] part 1, original variables _dta[datasignature_vl2] part 2, original variables, if necessary etc.

To access the original variables stored in _dta[datasignature_vl1], etc., from an ado-file, code

mata: ado_fromlchar("vars", _dta", "datasignature_vl")

Thereafter, the original variable list would be found in `vars'.

Methods and formulas

datasignature is implemented using _datasignature; see [P] _datasignature.

Also see

Manual: [D] datasignature

Help: [P] _datasignature, [P] signestimationsample


© Copyright 1996–2009 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index