Statalist The Stata Listserver

[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: Re: st: question on distribution of values

From   n j cox <>
Subject   Re: Re: st: question on distribution of values
Date   Tue, 10 Oct 2006 17:43:28 +0100

By "unique" here also understand "distinct".
This may be one of those mid-Atlantic linguistic problems
differentiating English and American.

After all, the Unix utility uniq, invented in New Jersey, removes
duplicate lines, and leaves just one copy of each of the distinct lines in a file. It does not identify lines that occur just once in the file.

Tim's suggestion is illegal in Stata, as only one -egen- function is allowed on the RHS of an -egen- command.

It would not be correct if it were legal, as -egen, count()- does not
count distinct values. There is a function in -egenmore- from SSC that does, but official Stata suffices here.

First, tag each distinct co-occurrence of -order- and -zip-

egen tag = tag(order zip)

Now sum within -order-

egen distinct = sum(tag), by(order)


egen distinct = total(tag), by(order)

Now you are home and dry

gen average_pkg_per_zip = qt / distinct

It took me several years to realise that the -nvals()-
function in -egenmore- was pretty much redundant
given the -tag()- function of -egen- that I introduced
earlier (although did not really invent).

Without -egen- this is

bysort order zip : gen tag = sum(_n == 1)
by order zip : replace tag = tag[_N]
by order : gen distinct = sum(tag)
by order : replace distinct = distinct[_N]
gen average_pkg_per_zip = qt / distinct


Timothy Mak

What about:

bysort order: egen qt2 = mean(Qt) / count(Qt)

Andrea King

Here's an example of the data I'm working with:

Order# Qt Zip
1 5 00011
1 5 00012
1 5 00013
1 5 00014
2 3 00021
2 3 00023
3 8 00031
3 8 00035
3 8 00036

Here are my problems:

1. The quantity of packages (qt) listed does not correspond directly to
the zip code. For example, Order #1 requested 5 packages, to be
distributed among each of four zip codes, or 1.25 packages per unique zip, not 5 packages per zip code.

2. I have yet to find the correct syntax that would allow me to create a
variable that would show the distribution of Qt among the zip codes. I've played with egen, but can't get it to work.

So my question is:

how can I take one value of Qt (or if needed, an average of Qt), within
each unique Order# and divide it by the number of unique zip codes by
order#? Also, if it helps, the order number is listed each
time the zip code changes, so a count of order# would probably work, too, but I'd prefer to do it by Zip.
* For searches and help try:

© Copyright 1996–2017 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index