Statalist The Stata Listserver

[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: Re: bysort problem

From   "Sergiy Radyakin" <>
To   <>
Subject   Re: st: Re: bysort problem
Date   Mon, 26 Feb 2007 19:49:54 +0100


I agree to all the critique below, except that the looping is not required.
First is a counterexample to the code submitted by Austin Nichols
input var2 label
1 1111
2 2222
3 3333
4 4444

replace var2=c(epsfloat) in 1/3
replace var2=c(epsfloat)+c(epsfloat) in 4

egen oldgroup=group(var2)
bys var2: gen obs=_n
replace var2=var2+c(epsfloat)*obs
egen newgroup=group(var2)
li, noo clean

Variable var1 is removed as not principal. Assume it is equal to the same value for all observations.

I kept the original replace statement -- although it will distort the data even in case when it is not necessary. A better statement: replace var2=var2+c(epsfloat)*(obs-1)

Let's approach this problem abstractly, if you have several repeating values of var2 you have to replace them by OTHER values not in the data. The trick here is how to choose the OTHER? How can we be sure that OTHER is not somewhere else in our data? A special case is when we know something about our data (e.g. all numbers are integers -- then we can generate fractions var2+(obs-1)*C where C is a constant small enough to accomodate all repetitions). However if absolutely nothing is known about the data -- one must take the next (+delta) value of var2 and CHECK if this value is in the data. If not, then one can use this value. But if not (!) one must SEARCH further until (LOOP) a proper value is found.

Notice that observation Nr 4 is generated to have exactly the same value as the replacement value for one of the repetitions. That's why the program fails with only three groups since observation labelled 2222 collides with the observation labelled 4444:

var2 label oldgroup obs newgroup
2.38e-07 1111 1 1 1
3.58e-07 2222 1 2 2
4.77e-07 3333 1 3 3
3.58e-07 4444 2 1 2

It might be that there is a Stata command that will generate a proper OTHER value, but the implementation of this command must necessarily involve a loop to probe the candidates. So it changes nothing in the above logic -- it's a mere redistribution between new and already existing code.

This situation is pretty artificial, though has one practical application. Application to temporary variables and filenames. Say you want to save your data temporarily. What filename to choose? One can choose "jf34kljhd894" and pray that it is not a copy of the windows register :). But one can also start with file0001.dat and increment 0001 until (loop) an empty slot is found. Windows API has a standard function which will return a temporary filename (see your browser's cache for examples). I guess Stata is just using it, but behind the scenes -- the same search is going on. The benefit of that function is that in a multy-process environment two processes can ask if file0014.dat exists and get negative answer. They will both start writing to the same file later on and collide. In order to avoid this -- Windows API must remember which filenames were reported to all the other processes as available, and not report those filenames as free, until one goes through all filenames not reported as available yet. (No econometrics here, whatsoever). There is no such difficulty with temporary variable names in Stata, since only 1 do file is executed at any time (1 process). A simple check (say starting from _000001) will soon yield an unused name (still a loop involved).

Obviously critique point #3 can be reduced by searching in both directions (positive and negative).

Hopefully this will close the topic.

Best regards, Sergiy

----- Original Message ----- From: "Nick Cox" <>
To: <>
Sent: Monday, February 26, 2007 6:58 PM
Subject: RE: st: Re: bysort problem

This thread is extraordinarily frustrating.
I still am not clear on what is desired and
on what is seen to be a problem.

Nikolaos stated at one point that he wanted
to eliminate duplicates. If this means -drop-
them from the data, then -duplicates drop-
is available in Stata, although writing your own code
would be instructive.

But it seems to mean "make them different", but
adding different small constants and then adding noise
have both been seized upon as solutions. Are
they equally attractive or appropriate?

At the risk of complicating an already convoluted
thread, I add further comments:

0. If `E' and `SE' are some kind of identifier, then some
coding as unique integers is likely to be optimal (and comments
below are irrelevant).

1. Changing the data needs to be justified.

2. Adding different constants and adding random
noise are not reproducible without further
constraints. The first depends on sort order
and the second on seed and time.

3. Adding even small amounts that are all positive
changes any location parameter for any variable.

I can't encourage any of the solutions offered
without knowing that there is an answer to 1 and
that 2 and 3 don't (won't) matter. But if 2 and 3
don't matter, why do all this in the first place?

Whatever the precise problem, I am confident,
with Austin Nichols, that _no_ looping should be required.


* For searches and help try:
*   For searches and help try:

© Copyright 1996–2017 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index