Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: st: data management question


From   Joe Canner <[email protected]>
To   "[email protected]" <[email protected]>
Subject   RE: st: data management question
Date   Thu, 19 Sep 2013 14:15:22 +0000

Caroline,

In addition to looking at the link Richard suggested, which involves looping through each observation individually and calculating the statistic for all observations except for the current observation, you might be able to get away with something simpler and quicker by noting that deleting one observation from a data set doesn't change the median all that much, just to the next higher or lower observed value, depending on which half of the dataset you delete from.  To illustrate with the auto data:

. webuse auto   // _N=74
. sort price
. gen medex=price[38] if _n<38   
. replace medex=price[37] if _n>=38

 This can be generalized for data sets of any size, but note that the rule will be slightly different for data sets with an odd number of observations.  For example, say you delete one observation from the auto data (now there are 73):

. gen medex=(price[37]+price[38])/2 if _n<37
. gen medex=(price[36]+price[37])/2 if _n>37
. gen medex=(price[36]+price[38])/2 if _n==37

 To do this with -by- groups in your data set use -bys ID (Md_T)- to do the calculations within each ID and to have the observations sorted in the proper order for calculating medians.

OK, so maybe it's not simpler, but it's an interesting exercise, nonetheless... :)

Regards,
Joe

-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Caroline Wilson
Sent: Thursday, September 19, 2013 1:28 AM
To: [email protected]
Subject: RE: st: data management question

Sorry to ask another question about this. I'm now struggling to create a variable called "median" which, for a given pat_ID, would be calculated by taking the MEDIAN of every other value of Md_T in the same phy_ID EXCEPT for the current value of Md_T. Below I show a sample of my data and what "median" should look like.
So for example: for pat_ID = 2, the phy_id=118. So I take the median value of Md_T for the other 2 pat_IDs belonging to phy_id=118 (median of 3.48 & 4.12, which is 3.8). For pat_ID=3, the phy_id=118. So I take the median value of Md_T for the other 2 pat_IDs belonging to phy_id=118 (1.85&4.12), which is 2.99.
I tried using similar logic as in Daniel's code for the mean, however the formula for the median is more complex than the mean formula (e.g. it depends on whether the total number of values is odd or even). Does anyone have ideas about how to calculate this? For example, maybe there is a way to use the median function just on every other value of the same phy_id but the current?
Any help would be much appreciated. Many thanks!!!

pat_ID    phy_id    Md_T  median
1          102       3.23     .
2          118       1.85   3.80
3          118       3.48   2.99
4          118       4.12   2.67
5          132       1.39   3.00
6          132       1.61   3.00
7          132       1.69   3.00
8          132       1.74   1.74
9          132       3.00   1.74
10         132       3.03   1.74
11         132       4.28   1.74
12         132       6.90   1.74

> From: [email protected]
> To: [email protected]
> Subject: RE: st: data management question
> Date: Wed, 18 Sep 2013 22:54:32 +0000
> 
> Apologies for the confusion - the variable "value" should have been called "Md_T".
> Anyway, your solution worked perfectly - very many thanks!!!!
> 
> Caroline
> 
> ----------------------------------------
>> Date: Wed, 18 Sep 2013 18:33:19 -0400
>> From: [email protected]
>> To: [email protected]
>> Subject: Re: st: data management question
>>
>>
>>
>> On Wed, 18 Sep 2013, Caroline Wilson wrote:
>>
>>> Hello,
>>>
>>>
>>>
>>> I'm wondering if someone can help with a data management question.
>>>
>>>
>>>
>>> I'm trying to create a variable called "mean", which, for a given
>>> pat_ID, would be calculated by taking the mean of every other value of
>>> "Md_T" in the same phy_ID EXCEPT for the current row.
>>>
>>
>> I am a little unclear on what you are asking - Md_T isn't in the sample
>> data you show, but you want the mean of it? So I won't use your variable
>> names. Nevertheless, I think that the -egen- -total- function and a
>> generate statement will get you what you want:
>>
>> by ID: egen sum=total(var)
>> generate sumex= sum-var
>> by ID: generate meanex = sumex/(_N-1)
>>
>> The total by ID gives the sum of var for each level of ID. Then we
>> subtract the current level of var and divide by the number of observations
>> in the ID group. (_N is the number of observations in the by group).
>>
>> Daniel Feenberg 
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/faqs/resources/statalist-faq/
> * http://www.ats.ucla.edu/stat/stata/ 		 	   		  
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index