[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Cindy Gao <cindy.gao@ymail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: ranking with weights |

Date |
Tue, 2 Dec 2008 21:33:42 +0000 (GMT) |

Thank you very much this helps a lot. However I wonder if there is a small "error" or if I am just misunderstanding. Should your last line of code ( replace rank = rank - 0.5 * totalfreq) not maybe only apply to observations that are tied (ie same expenditure as other observations)? Otherwise for example the first observation in your example, which is not tied, is ranked as 9000 instead of its weight of 18000. I therefore try a small modification to your code (by expenditure: replace rank = rank - 0.5 * totalfreq if _N != 1). when I do like this then the rank of the last observation (which is not tied) equals the sum of all the weights, whereas with your original the rank of the last observation is less than the sum of all the weights (less by half the weighting of the last observation). Now, I am not confident whether to use my modification or maybe I am just confused and I should stick with Nick's original suggestion? many thanks, Cindy ----- Original Message ---- From: Nick Cox <n.j.cox@durham.ac.uk> To: statalist@hsphsun2.harvard.edu Sent: Tuesday, 2 December, 2008 19:45:41 Subject: RE: st: ranking with weights The following example code with a toy dataset may help: . list expenditure frequency +---------------------+ | expend~e freque~y | |---------------------| 1. | 1000 8000 | 2. | 1000 10000 | 3. | 2000 6000 | 4. | 2000 9000 | 5. | 3000 8000 | |---------------------| 6. | 3000 4000 | 7. | 4000 7000 | 8. | 4000 6000 | 9. | 5000 6000 | 10. | 6000 5000 | |---------------------| 11. | 7000 4000 | 12. | 8000 3000 | 13.. | 9000 2000 | 14. | 10000 1000 | +---------------------+ . bysort expend : gen totalfreq = sum(frequency) . by expend : replace totalfreq = totalfreq[_N] (4 real changes made) . by expend : gen first = _n == 1 . gen rank = sum(totalfreq * first) . replace rank = rank - 0.5 * totalfreq (14 real changes made) . list +------------------------------------------------+ | expend~e freque~y totalf~q first rank | |------------------------------------------------| 1. | 1000 8000 18000 1 9000 | 2. | 1000 10000 18000 0 9000 | 3. | 2000 6000 15000 1 25500 | 4. | 2000 9000 15000 0 25500 | 5. | 3000 8000 12000 1 39000 | |------------------------------------------------| 6. | 3000 4000 12000 0 39000 | 7. | 4000 7000 13000 1 51500 | 8. | 4000 6000 13000 0 51500 | 9. | 5000 6000 6000 1 61000 | 10. | 6000 5000 5000 1 66500 | |------------------------------------------------| 11. | 7000 4000 4000 1 71000 | 12. | 8000 3000 3000 1 74500 | 13. | 9000 2000 2000 1 77000 | 14. | 10000 1000 1000 1 78500 | +------------------------------------------------+ There is a little inaccuracy there: the average of ranks 1...18000 is strictly 9000.5 not 9000, so you may want to make the appropriate corrections. Nick n.j..cox@durham.ac.uk Cindy Gao The observations (analytic units) are households. Expenditure is the monthly expenditure of household. This is household survey data. The weights are frequency weights, to weight the sample to the whole country. The weights are likely to vary across for example regions, to compensate for oversampling or undersampling. Basically I need to rank all households according to their expenditure, from lowest to highest. But, I must take account of the weightings. If for example there are 2 households with the same expenditure, they must be ranked the same and this rank must take account of weightings. If there were no ties (households with same expenditure), I could achieve mission by generating a variable "rank", like -g rank=sum(weight)-. The problem comes because of ties. If i could -expand- my dataset using weights, then i could simply say -egen rank =rank(expenditure)- ; the problem is that dataset is too large for this. Steven Samuels Cindy, What are the analytic units (people? regions?). What are the "weights"? What is "expenditure"? How is it measured. What do you mean that some regions are "less sampled" than others. It's not clear, for example, if this is a sample, and if so, of what? So, please describe the study design in detail. Last question: what is the purpose of the ranking? On Dec 2, 2008, at 12:54 PM, Cindy Gao wrote: > I am trying to find a way to rank weighted data (since the egen function -rank- does not work with weights). A simple way would be order the data in terms of variable that I have interest in (monthly expenditure) and then create a new variable like -g rank1=sum(weight)-. But, there is problem.. Some of my observations are "tied" as they have the same level of expenditure. Using the simple method I mention means that some observations are ranked above others even though they have same level of expenditure. This is a problem as the weights are large so you find that 2 observations are ranked with bug gap in between even though same level of expenditure. It is even bigger problem because the weights might be correlated with some other variables I am interested in (like region, since some regions are less sampled than other). I also try multiplying the expenditure ranking by the weight, but this gives wrong results (for example they do not add up to weighted > total). Can anyone help? In other words, I would like for all observations with same expenditure to have same rank, which I assume would be some average of all the weighted observations having that same expenditure.. I include a sample dataset below: > > expenditure weighting rank rank1 weighted_rank > 10 341 1 341 341 > 12 1065 2.5 1406 ??? > 12 98 2.5 1504 > 15 254 4 1758 > ....... * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**RE: st: ranking with weights***From:*"Nick Cox" <n.j.cox@durham.ac.uk>

**References**:**st: ranking with weights***From:*Cindy Gao <cindy.gao@ymail.com>

**Re: st: ranking with weights***From:*Steven Samuels <sjhsamuels@earthlink.net>

**Re: st: ranking with weights***From:*Cindy Gao <cindy.gao@ymail.com>

**RE: st: ranking with weights***From:*"Nick Cox" <n.j.cox@durham.ac.uk>

- Prev by Date:
**st: RE: Length for strings, ignoring SMCL tags** - Next by Date:
**Re: st: Length for strings, ignoring SMCL tags** - Previous by thread:
**RE: st: ranking with weights** - Next by thread:
**RE: st: ranking with weights** - Index(es):

© Copyright 1996–2014 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |