Hi,
I agree to all the critique below, except that the looping is not required.
First is a counterexample to the code submitted by Austin Nichols
***----------------------------------------------------
input var2 label
1 1111
2 2222
3 3333
4 4444
end
replace var2=c(epsfloat) in 1/3
replace var2=c(epsfloat)+c(epsfloat) in 4
li
egen oldgroup=group(var2)
bys var2: gen obs=_n
replace var2=var2+c(epsfloat)*obs
egen newgroup=group(var2)
li, noo clean
***----------------------------------------------------
Variable var1 is removed as not principal. Assume it is equal to the same
value for all observations.
I kept the original replace statement -- although it will distort the data
even in case when it is not necessary. A better statement: replace
var2=var2+c(epsfloat)*(obs-1)
Let's approach this problem abstractly, if you have several repeating values
of var2 you have to replace them by OTHER values not in the data. The trick
here is how to choose the OTHER? How can we be sure that OTHER is not
somewhere else in our data? A special case is when we know something about
our data (e.g. all numbers are integers -- then we can generate fractions
var2+(obs-1)*C where C is a constant small enough to accomodate all
repetitions). However if absolutely nothing is known about the data -- one
must take the next (+delta) value of var2 and CHECK if this value is in the
data. If not, then one can use this value. But if not (!) one must SEARCH
further until (LOOP) a proper value is found.
Notice that observation Nr 4 is generated to have exactly the same value as
the replacement value for one of the repetitions. That's why the program
fails with only three groups since observation labelled 2222 collides with
the observation labelled 4444:
var2 label oldgroup obs newgroup
2.38e-07 1111 1 1 1
3.58e-07 2222 1 2 2
4.77e-07 3333 1 3 3
3.58e-07 4444 2 1 2
It might be that there is a Stata command that will generate a proper OTHER
value, but the implementation of this command must necessarily involve a
loop to probe the candidates. So it changes nothing in the above logic --
it's a mere redistribution between new and already existing code.
This situation is pretty artificial, though has one practical application.
Application to temporary variables and filenames. Say you want to save your
data temporarily. What filename to choose? One can choose "jf34kljhd894" and
pray that it is not a copy of the windows register :). But one can also
start with file0001.dat and increment 0001 until (loop) an empty slot is
found. Windows API has a standard function which will return a temporary
filename (see your browser's cache for examples). I guess Stata is just
using it, but behind the scenes -- the same search is going on. The benefit
of that function is that in a multy-process environment two processes can
ask if file0014.dat exists and get negative answer. They will both start
writing to the same file later on and collide. In order to avoid this --
Windows API must remember which filenames were reported to all the other
processes as available, and not report those filenames as free, until one
goes through all filenames not reported as available yet. (No econometrics
here, whatsoever). There is no such difficulty with temporary variable names
in Stata, since only 1 do file is executed at any time (1 process). A simple
check (say starting from _000001) will soon yield an unused name (still a
loop involved).
Obviously critique point #3 can be reduced by searching in both directions
(positive and negative).
Hopefully this will close the topic.
Best regards, Sergiy
----- Original Message -----
From: "Nick Cox" <n.j.cox@durham.ac.uk>
To: <statalist@hsphsun2.harvard.edu>
Sent: Monday, February 26, 2007 6:58 PM
Subject: RE: st: Re: bysort problem
This thread is extraordinarily frustrating.
I still am not clear on what is desired and
on what is seen to be a problem.
Nikolaos stated at one point that he wanted
to eliminate duplicates. If this means -drop-
them from the data, then -duplicates drop-
is available in Stata, although writing your own code
would be instructive.
But it seems to mean "make them different", but
adding different small constants and then adding noise
have both been seized upon as solutions. Are
they equally attractive or appropriate?
At the risk of complicating an already convoluted
thread, I add further comments:
0. If `E' and `SE' are some kind of identifier, then some
coding as unique integers is likely to be optimal (and comments
below are irrelevant).
1. Changing the data needs to be justified.
2. Adding different constants and adding random
noise are not reproducible without further
constraints. The first depends on sort order
and the second on seed and time.
3. Adding even small amounts that are all positive
changes any location parameter for any variable.
I can't encourage any of the solutions offered
without knowing that there is an answer to 1 and
that 2 and 3 don't (won't) matter. But if 2 and 3
don't matter, why do all this in the first place?
Whatever the precise problem, I am confident,
with Austin Nichols, that _no_ looping should be required.
Nick
n.j.cox@durham.ac.uk
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/