[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
"Cruces,GA (pgr)" <G.A.Cruces@lse.ac.uk> |

To |
<statalist@hsphsun2.harvard.edu> |

Subject |
st: "Serial" generator - was: Conversion to Hexadecimal |

Date |
Sun, 20 Jul 2003 21:10:43 +0100 |

Following today's exchange with N. Cox, I provide below a code based 99.9% on his hex and base functions for egen (but of course if it fails it's due to my 0.1). Please note that it has not been fully tested. I called this "serial", since I use it to create a "serial number" string, but in fact what it does is converting a decimal into a base 36 number (0 to 9 and then a to z). I already explained my reasons for preferring this type of identifier. The relative cost in terms of efficiency is not huge, at least for the type of data I use: with 2.3 millions observations, I need a long to store a numerical id (4 bytes), and only 25% more (a str5) to store my 36x id (with five bytes you can get up to 36^5=60,466,176 unique identifiers, if I'm not wrong). For datasets with many more observations the cost in terms of memory may be much higher. An advantage is that by changing slightly the code below, you can produce almost "untraceable" ids: for instance, if id numbers are relevant but the data producer wants to mask their real number (say, if they have any meaning in the source dataset), modifying the order or the content of the string "abcdefghijklmnopqrstuvwxyz" below will produce (though the same is true, of course, for N. Cox's hex function). Only the person having the original key should be able to trace the original number (if the data is "de-sorted" by id, of course). This is working for me - I hope there are no mistakes in the code below and that it might be useful to someone else. best, g. *! 1.0.0 NJC 20 July 2003 *Heavily based on _ghex by N. Cox *MODIFIED BY G CRUCES 20 July 2003 program define _gserial version 6.0 gettoken type 0 : 0 gettoken g 0 : 0 gettoken eqs 0 : 0 syntax varname(numeric) [if] [in] marksample touse * ignores type passed from -egen- local type "str1" local base = 36 capture assert `varlist' == int(`varlist') if `touse' if _rc { di in r "`varlist' invalid: not integer" exit 459 } capture assert `varlist' >= 0 if `touse' local sign = _rc != 0 quietly { tempvar work digit gen `type' `g' = "" gen long `work' = `varlist' if `touse' gen int `digit' = . su `work', meanonly local max = max(`r(max)',-`r(min)') local power = 0 while `max' >= (`base'^(`power' + 1)) { local power = `power' + 1 } if `sign' { replace `g' = `g' + cond(`work' < 0, "-","+") if `touse' replace `work' = abs(`work') } while `power' >= 0 { replace `digit' = int(`work' / `base'^`power') replace `work' = mod(`work', `base'^`power') replace `g' = `g' + /* */ string(`digit') if `touse' & `digit' <= 9 * CHANGE or REORDER THE "abcd...vwxyz" below * to produce "untraceable" IDs replace `g' = `g' + /* */ substr("abcdefghijklmnopqrstuvwxyz", `digit' - 9, 1) /* */ if `touse' & `digit' >= 10 local power = `power' - 1 } replace `g' = substr(`g',2,.) if substr(`g',1,1) == "0" } end * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

- Prev by Date:
**st: RE: Svytab subgroup analysis with more than two subgroups** - Next by Date:
**Re: st: How to test for Fixed Effect after "areg"** - Previous by thread:
**st: RE: Svytab subgroup analysis with more than two subgroups** - Next by thread:
**st: RE: RE: Conversion to Hexadecimal** - Index(es):

© Copyright 1996–2016 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |