[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: Re: bysort problem

From	"Austin Nichols" <[email protected]>
To	[email protected]
Subject	Re: st: Re: bysort problem
Date	Mon, 26 Feb 2007 14:20:37 -0500

Sergiy Radyakin et al.--

I agree with everything Nick Cox said earlier in this thread, and
especially with the statement that the thread is extraordinarily
frustrating.  I don't think the exercise is necessary, i.e. I don't
think any real data management or statistical application requires
replacing identical values of a variable with distinct values. Also,
for the outcome to be reproducible, we need to prespecify the sort
order, regardless of the method used.

That said, Sergiy seems to have some confusion about what it means for
two values to be different (in any real sense) in his "counterexample"
which exploits the limiting case I alluded to earlier: adjacent values
of var2 differ by the minimum value that distinguishes values, or
c(epsfloat) in the example, in which case you would have to switch to
incrementing by c(epsdouble), or in the case that adjacent values
differ only by c(epsdouble), by some smaller amount (presumably in the
Mata environment).

Put another way, in the contrived counterexample, all values of var2
are essentially zero, which necessitates (in the case shown) adding
c(epsdouble)*obs. If values of var2 differ only by
max(c(epsdouble)*obs) in some cases, then you are in real trouble, and
might consider rounding var2 to float precision before engaging in
this pointless exercise.

If you care about differences in var2 of this minuscule magnitude,
then anything you add or subtract from var2 is obviously going to
screw up any future calculations.  If you don't care about the actual
values in var2, and want them merely to be identifiers, then they
should be replaced by integers that step from 1 to _N (which yes, are
guaranteed to be unique).

Easy solution (does nothing to maintain distribution of var2):
 replace var2=_n

On 2/26/07, Sergiy Radyakin <[email protected]> wrote:

I agree to all the critique below, except that the looping is not required.
First is a counterexample to the code submitted by Austin Nichols
***----------------------------------------------------
input var2 label
1 1111
2 2222
3 3333
4 4444
end
replace var2=c(epsfloat) in 1/3
replace var2=c(epsfloat)+c(epsfloat) in 4
li
egen oldgroup=group(var2)
bys var2: gen obs=_n
replace var2=var2+c(epsfloat)*obs
egen newgroup=group(var2)
li, noo clean
***----------------------------------------------------

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

References:
- RE: st: Re: bysort problem
  - From: "Nick Cox" <[email protected]>
- Re: st: Re: bysort problem
  - From: "Sergiy Radyakin" <[email protected]>

Prev by Date: st: Re: Seed and time
Next by Date: Re: st: Re: Seed and time
Previous by thread: Re: st: Re: bysort problem
Next by thread: st: Seed and time
Index(es):
- Date
- Thread