# Re: st: Help on data manipulation

 From Phil Schumm To statalist@hsphsun2.harvard.edu Subject Re: st: Help on data manipulation Date Mon, 27 Sep 2004 19:39:08 -0500

```At 5:29 PM -0400 9/27/04, MITRA PINAKI  (MAR1PXM) wrote:
```
```    I have a dataset on multiple customer's revenue information like the
following:

Id    customer    revenue
Z11    xx1         y1
Z21    xx1         y2
Z31    xx1         y3
Z12    xx2         k1
Z22    xx2         k2
Z32    xx2         k3  and so on.

I need to create a new data based on the following criteria. If the sum of
revenue by customer is <=999, I need to write
Customer    revenue
xx1         sum of y
xx2         sum of k

But if the sum of revenue by customer is >999 and <=1998, I need to write
Id     Customer    revenue
001    xx1         999
002    xx1         sum of y-999
001    xx2         999
002    xx2         sum of k-999

and if the sum of revenue by customer is >1998 and <=2997, I need to write
Id     Customer    revenue
001    xx1         999
002    xx1         999
003    xx1         sum of y-1998
001    xx2         999
002    xx2         999
003    xx2         sum of k-1998

Can anyone please provide me any help on this?
```

Here's one alternative:

collapse (sum) revenue, by(customer)
expand ceil(revenue/999)
bys customer: replace revenue = ((_n==_N)*mod(revenue,999)) + ((_n!=_N)*999) if 999<revenue

Briefly, -collapse- computes the total revenue for each customer, resulting in a dataset with one record per customer. -expand- then creates the additional records (e.g., if a customer's total is 2000, this will create two additional records for a total of 3). Finally, the last line replaces revenue either with the value 999 (all records but the last one for each customer) or with the remainder once the multiples of 999 are accounted for (last record for each customer). If you wish, you could then generate your ids for each record (consecutive integers within customer in leading-zero format) using the following:

gsort + customer - revenue
by customer: gen id = string(_n,"%03.0f")

Note that with this solution, records with missing values for revenue in the original dataset will essentially be ignored. However, if a customer has nothing but missing values in the original dataset, the resulting dataset will contain a single record with a revenue of zero (this is because -collapse- treats missing values as zeros). If this is not what you want, you'll need to modify the code accordingly.

-- Phil
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/