Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Routine from do-file that every time it's run gives a different result
From
Sergiy Radyakin <[email protected]>
To
"[email protected]" <[email protected]>
Subject
Re: st: Routine from do-file that every time it's run gives a different result
Date
Thu, 7 Nov 2013 12:48:20 -0500
Clarice,
the following article discusses what Excel is doing to compute quartiles:
http://stats.stackexchange.com/questions/28123/quartiles-in-excel
In general don't expect different statistical packages to break your
observations into groups (quartiles, quintiles, deciles) identically.
This applies not only to Excel, but also to SPSS, SAS, etc.
http://www-01.ibm.com/support/docview.wss?uid=swg21480663
http://www.erieri.com/blog/post/technically-speaking-does-excel-always-know-what-is-best-for-your-compensation-data
and tons of other discussions, just check Google.
Multiple methods exist, and the defaults are not always identical
across the packages.
In some cases it might be better to be explicit, and sort and break
the dataset into groups yourself, rather then rely on the canned
percentile functions. Better read your code line by line, and check if
it implements exactly what you want it to do.
Hope this helps, Sergiy
On Thu, Nov 7, 2013 at 12:03 PM, Nick Cox <[email protected]> wrote:
> -xtile- is undoubtedly problematic -- as it reduces the information in
> your data and isn't guaranteed to produce equal-sized groups even
> when the number of observations is an exact multiple of the number of
> groups. But one of its rules is that observations with the same value
> always go into the same group. And -xtile- is written -sortpreserve-
> so it doesn't change the sort order of your data.
> Nick
> [email protected]
>
>
> On 7 November 2013 16:47, Clarice Martins <[email protected]> wrote:
>> Thanks to all for the valuable input...
>>
>> Sarah, thanks for the practical tips on how to troubleshoot, I am definitely very new at any kind of programming and needed this kind of advice.
>>
>> Nick, I agree that it is important to verify where the error is, the -stable- option might aid me, but I will definitely search forward to figure out where are my hidden assumptions.
>>
>> Just another question on this issue:
>> Another portion of the code uses -xtile- to break the portfolio of returns in quintiles. (at first, I didn't think it was important.)
>>
>> But... Could there be a problem also with how -xtile- break the dataset in groups? I mean, even when I did this manually in Excel, it was always difficult to decide how many observations stay in each quintile group. (e.g.: if the dataset has 21 observations, we will have 4 groups of 4 and one group of 5, which group takes the extra observation?)
>>
>> Thanks again!!!
>> Clarice
>>
>>
>> On Nov 6, 2013, at 8:11 PM, Sarah Edgington wrote:
>>
>>> Clarice,
>>> Nick's right that you need to do more digging. However, I would argue that
>>> the solution of using the stable option to -sort- is worse than "[solving]
>>> the problem with the price of not understanding it." Using -sort, stable-
>>> is actually just pretending that there is not a problem at all. Yes, that
>>> strategy will get you consistent results, but the chances that they'll be
>>> the right results are pretty slim. Being able to reproduce the wrong answer
>>> is generally just as bad as not being able to reproduce the answer at all.
>>>
>>> To be a bit more explicit, what sort order you end up with clearly matters
>>> for your results. You need to figure out why the variables you're sorting
>>> on are not producing unique results and figuring out how to fix that that.
>>> Using -sort, stable- may very well appear to fix your problem but presumably
>>> you care whether the average of P5 is 6.154 or 3.286. If you don't do more
>>> investigation you'll never know which of those is the number you're really
>>> looking for (or whether it's something else completely).
>>>
>>> One thing that I find useful when troubleshooting this kind of problem is to
>>> use -sum- after every section where I create new variables with values where
>>> sort order matters. Then I'll run the dofile multiple times, saving a
>>> logfile with a different name each time. Usually you can pretty quickly
>>> spot where things went wrong by comparing the log files from two different
>>> runs, as long as you put in descriptive of your created variables along the
>>> way.
>>>
>>> Another useful command when trying to identify whether you're uniquely
>>> sorting observations is -isid-. Any combination of variables that don't
>>> function as a unique ID will leave you with ties on the sort, leading to the
>>> kind of unpredictable results you see here.
>>>
>>> -Sarah
>>>
>>> -----Original Message-----
>>> From: [email protected]
>>> [mailto:[email protected]] On Behalf Of Nick Cox
>>> Sent: Wednesday, November 06, 2013 1:51 PM
>>> To: [email protected]
>>> Subject: Re: st: Routine from do-file that every time it's run gives a
>>> different result
>>>
>>> But that solves the problem with the price of not understanding it.
>>> Somewhere Clarice has hidden assumptions about the -sort- order being enough
>>> to get the right order without extra information that are not correct.
>>> Nick
>>> [email protected]
>>>
>>>
>>> On 6 November 2013 21:46, Sergiy Radyakin <[email protected]> wrote:
>>>> Clarice, add the option stable to the sort commands. Without this
>>>> option, the -sort- command will break the ties randomly. See here:
>>>> http://www.stata.com/help.cgi?sort
>>>>
>>>> Best, Sergiy
>>>>
>>>> On Wed, Nov 6, 2013 at 4:30 PM, Clarice Martins
>>>> <[email protected]> wrote:
>>>>> Dear Statalist group,
>>>>>
>>>>> I have a routine that apparently was running ok, and then I noticed that
>>> everytime I execute the code I get different results for one of the
>>> variables.
>>>>> (The routine is long, so I don't know how to best provide you guys
>>>>> with enough info.)
>>>>>
>>>>> 1) I believe the problem has to do with variable -P5- since this is the
>>> variable which average changes every time I run the code.
>>>>>
>>>>> 2) Sample of the results, I am getting: as you can see variable P1
>>>>> is always approximately the same (it should be the same) and variable
>>>>> Strategy is ALWAYS the same, but var -P5- changes by a lot. (I've
>>>>> shown two outputs, but I've ran it several, several times.)
>>>>>
>>>>>
>>>>> . esttab .
>>>>>
>>>>> ----------------------------
>>>>> (1)
>>>>> Mean
>>>>> ----------------------------
>>>>> P1 0.300***
>>>>> (3.41)
>>>>>
>>>>> P5 6.154
>>>>> (1.53)
>>>>>
>>>>> strategy 7.190
>>>>> (1.78)
>>>>> ----------------------------
>>>>> N 150
>>>>> ----------------------------
>>>>>
>>>>>
>>>>> ----------------------------
>>>>> (1)
>>>>> Mean
>>>>> ----------------------------
>>>>> P1 0.223*
>>>>> (2.24)
>>>>>
>>>>> P5 3.286
>>>>> (1.15)
>>>>>
>>>>> strategy 7.190
>>>>> (1.78)
>>>>> ----------------------------
>>>>> N 150
>>>>> ----------------------------
>>>>>
>>>>> 3) Piece of the code that deals with creating and changing variable
>>>>> P5: (my apologies if this is confusing or too long)
>>>>>
>>>>> ***create variable P1/P5 and sum all 1st/5th quintiles per <yrmonth>
>>>>> gen P1_sell = .
>>>>> quietly levelsof yrmonth, local(levs) quietly foreach lev of local
>>>>> levs {
>>>>> egen work=total(return) if rtype=="buy_sell_period" & yrmonth ==
>>> "`lev'" & quintile==1
>>>>> replace P1_sell=work if rtype=="buy_sell_period" & yrmonth ==
>>> "`lev'" & quintile==1
>>>>> drop work
>>>>> }
>>>>>
>>>>> gen P5_buy = .
>>>>> quietly levelsof yrmonth, local(levs) quietly foreach lev of local
>>>>> levs {
>>>>> egen work=total(return) if rtype=="buy_sell_period" & yrmonth ==
>>> "`lev'" & quintile==5
>>>>> replace P5_buy=work if rtype=="buy_sell_period" & yrmonth ==
>>> "`lev'" & quintile==5
>>>>> drop work
>>>>> }
>>>>>
>>>>> sort quintile yrmonth rtype
>>>>>
>>>>> **undo the buy/sell operation
>>>>> *in order to do the procedure, first copy quintile #s to same <co_id>
>>>>> but for 6 <yrmonth> LATER
>>>>>
>>>>> bysort co_id period: egen tocopy2 = total(quintile / (rtype ==
>>>>> "buy_sell_period")) bysort co_id rtype (negperiod) : replace quintile =
>>> tocopy2[_n+6] if missing(quintile) & rtype == "hold_period"
>>>>> sort quintile yrmonth rtype
>>>>>
>>>>> *add sums of 1st/5th quintiles for <hold_period> to variables P1/P5
>>>>>
>>>>> quietly levelsof yrmonth, local(levs) quietly foreach lev of local
>>>>> levs {
>>>>> egen work=total(return) if rtype=="hold_period" & yrmonth ==
>>> "`lev'" & quintile==5
>>>>> replace P1_sell=work if rtype=="hold_period" & yrmonth == "`lev'"
>>> & quintile==5
>>>>> drop work
>>>>> }
>>>>>
>>>>> quietly levelsof yrmonth, local(levs) quietly foreach lev of local
>>>>> levs {
>>>>> egen work=total(return) if rtype=="hold_period" & yrmonth ==
>>> "`lev'" & quintile==1
>>>>> replace P5_buy=work if rtype=="hold_period" & yrmonth == "`lev'"
>>> & quintile==1
>>>>> drop work
>>>>> }
>>>>> sort quintile yrmonth rtype
>>>>>
>>>>>
>>>>> ***------procedures for Strategy analysis **preparing time-series
>>>>> *P1 is the variable to use for the time-series / keep -P1_sell-
>>>>> intact just for the sake of it
>>>>>
>>>>> gen P1 = P1_sell
>>>>> gen copyP1=P1
>>>>> replace P1 = . if P1 == copyP1[_n-1]
>>>>> drop copyP1
>>>>>
>>>>> *P5 is the variable to use for the time-series / keep -P5_buy- intact
>>>>> just for the sake of it
>>>>>
>>>>> gen P5 = P5_buy
>>>>> gen copyP5=P5
>>>>> replace P5 = . if P5 == copyP5[_n-1]
>>>>> drop copyP5
>>>>>
>>>>> *keeping only time-series variables & unique records keep P1 P5
>>>>> period
>>>>>
>>>>> sort period P1 P5
>>>>> quietly by period P1 P5: gen dup = cond(_N==1,0,_n) drop if dup>0
>>>>> drop dup
>>>>>
>>>>> sort period P1 P5
>>>>> gen P5copy = P5
>>>>> replace P5 = P5copy[_n+1] if P5 >= .
>>>>> replace P5 = P5copy[_n+3] if P5 >= .
>>>>> drop P5copy
>>>>>
>>>>> sort period
>>>>> quietly by period: gen dup = cond(_N==1,0,_n) drop if dup>2 drop dup
>>>>>
>>>>> gen temp = P1 + P5
>>>>> drop if temp >= .
>>>>> drop temp
>>>>>
>>>>> by period: egen strategy=total(P1 + P5)
>>>>>
>>>>> sort strategy
>>>>> quietly by strategy: gen dup = cond(_N==1,0,_n) drop if dup>1 drop
>>>>> dup
>>>>>
>>>>> sort period
>>>>>
>>>>> ** changing into a time-series // not sure if it is necessary yet...
>>>>> tsset period
>>>>> mean P1 P5 strategy
>>>>> ******end of code
>>>>>
>>>>> Thanks for your consideration! Any comment or suggestions will be
>>> appreciated.
>>>>> Clarice
>>>>>
>>>>>
>>>>> *
>>>>> * For searches and help try:
>>>>> * http://www.stata.com/help.cgi?search
>>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>>>>> * http://www.ats.ucla.edu/stat/stata/
>>>>
>>>> *
>>>> * For searches and help try:
>>>> * http://www.stata.com/help.cgi?search
>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>>>> * http://www.ats.ucla.edu/stat/stata/
>>>
>>> *
>>> * For searches and help try:
>>> * http://www.stata.com/help.cgi?search
>>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>>> * http://www.ats.ucla.edu/stat/stata/
>>>
>>> *
>>> * For searches and help try:
>>> * http://www.stata.com/help.cgi?search
>>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>>> * http://www.ats.ucla.edu/stat/stata/
>>
>>
>> *
>> * For searches and help try:
>> * http://www.stata.com/help.cgi?search
>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>> * http://www.ats.ucla.edu/stat/stata/
>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/faqs/resources/statalist-faq/
> * http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/