Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: st: Observations that keep a feature...


From   "Miguel Angel Duran Munoz" <[email protected]>
To   [email protected]
Subject   RE: st: Observations that keep a feature...
Date   Thu, 23 May 2013 19:43:32 +0200 (CEST)

Nick and Sarah, thanks to your help I've been able to solve all but one of
my problems. To select agents that are above the threshold after period 2,
I've finally used:

egen firstperiod = min(period), by(agent)
drop if firstperiod > 2
bysort agent (period): gen first2 = _n < 3
egen min_rest = min(score / !first2), by(agent)
keep if min_rest >= 0.9

(the max condition that Nick suggested me is, I think, unnecessary)

Nevertheless, I am not sure about how to select agents that overpass the
threshold in the final periods (say at or after t3) and maintain over it.
In principle, based on your suggestions, I thought of this:

bysort agent (period): gen last=score[_N]
bysort entity (date2): gen first2 = _n < 3
egen min_rest = min(score / !first2), by(agent)
keep if last>=0.9 & min_rest<=0.9

Nevertheless, this implies that I am excluding agents that satisfy the
criterion (overpassing the threshold at or after t3) but appear in the
sample at an intermediate period.

Will someone please help to solve this? Thanks in advance.

Miguel.

> Sarah, thank you for your help. I am very sorry for not having put my
> doubts in a sufficiently clear way. And given what you say about the way
> data is stored I have realized that there might be other problems around.
> I will try to be as clear as possible.
>
> My data is in panel data form. I write the example down again in the way
> my data is stored. As regards the example in my previous messages, I add
> two agents (6 and 7). Please note also that data referring to agent fifth
> is missing in some periods, but there is no line corresponding to those
> periods (this is what I had not taken into account so far):
>
> time  agent   score
> t1     1      0.8
> t2     1      1
> t3     1      1
> t4     1      1
> t5     1      1
> t6     1      1
>
> t1     2      0.8
> t2     2      0.8
> t3     2      1
> t4     2      1
> t5     2      1
> t6     2      1
>
> t1     3      0.8
> t2     3      0.8
> t3     3      0.8
> t4     3      1
> t5     3      1
> t6     3      1
>
> t1     4      0.8
> t2     4      0.8
> t3     4      0.8
> t4     4      0.8
> t5     4      1
> t6     4      1
>
> t6     5      1
>
> t1     6      0.8
> t2     6      0.8
> t3     6      0.8
> t4     6      0.8
> t5     6      1
> t6     6      1
>
> t1     7      0.8
> t2     7      1
> t3     7      1
> t4     7      0.8
> t5     7      0.8
> t6     7      1
>
> Having said that, I want to split the sample in different ways. First, I
> want to focus on agents that overpass a threshold (eg, 0.9) since the
> first period and are always above the threhold (ie, agent 1). Second, I
> want to take agents that overpass the threshold at or before a particular
> period (eg, t3) and since then they are above the threshold (ie, agents
> 1-4). Third, agents that overpass the threshold at or after a particular
> period (eg, t5) and since then they are above the threshold (ie, agents 5
> and 6). Please note that agent 7 is not included in any of the previous
> subsamples.
>
> Thank you very much for your help. And once again, I am sorry for not
> having been clear enough.
>
> Miguel.
>
>
>
>
>> Miguel,
>> This discussion would be clearer if your examples actually made it clear
>> exactly what your data looks like.
>>
>> Your example below looks like you have data in wide form.  The solution
>> that Nick suggested is for data in long form.  It's easy enough to move
>> between the two, but it's hard to make concrete suggestions about how to
>> proceed when we don't know what the actual data looks like.
>>
>> I'll start by assuming, as Nick does, that your data is actually in long
>> form and you have three variables: agent, period, score.  I'll further
>> assume that for agent 5 you simply have no records for periods 1-5 (that
>> is, you do not have records for those periods with missing values for
>> score).  If that's true, you can simply calculate the first period that
>> appears in the data and use that as part of your inclusion criteria.
>> Something like the following will keep only those agents who first
>> appear
>> in the data before period 4:
>> egen firstperiod=min(period), by(agent)
>> drop if firstperiod>4
>>
>> Or maybe you only want to include agents who start in period 1?  It's
>> unclear from your question.  In that case you'd -drop if firstperiod>1-
>>
>> For your second example, trying to look at the last time periods, I
>> think
>> you need to clarify what your actual criteria is.  You say "I would like
>> to select those agents that overpass the threshold of 0.9 in any the
>> last
>> two periods and are over the threshold until the end of the sample
>> period
>> (ie, agents 4 and 5)."  To my eye, that criteria includes all agents
>> except agent 6.  You're unlikely to get the results you hope for unless
>> you are precise in the criteria you're using.
>>
>> Hope that helps.
>>
>> -Sarah
>>
>>
>> -----Original Message-----
>> From: [email protected]
>> [mailto:[email protected]] On Behalf Of Miguel Angel
>> Duran Munoz
>> Sent: Wednesday, May 22, 2013 11:00 AM
>> To: [email protected]
>> Subject: Re: st: Observations that keep a feature... an additional
>> problem
>>
>> I use the same example than in a previous message, but I add a fifth
>> agent
>> that joins in period six:
>>
>>
>> Agent 1: 1    1    1    1    1    1...
>> Agent 2: 0.8  1    1    1    1    1...
>> Agent 3: 0.8  0.8  0.8  1    1    1...
>> Agent 4: 0.8  0.8  0.8  0.8  1    1...
>> Agent 5:  .    .    .    .   .    1...
>>
>> I want to keep just the first three agents.
>>
>>
>> If you don't mind, Nick, I would also like to ask you the following. I
>> take the same example, but I focus on the last periods.
>>
>> Agent 1: ...1    1    1    1    1    1
>> Agent 2: ...0.8  1    1    1    1    1
>> Agent 3: ...0.8  0.8  0.8  1    1    1
>> Agent 4: ...0.8  0.8  0.8  0.8  1    1
>> Agent 5: ... .    .    .    .   .    1
>> Agent 6: ...0.8  0.8  0.8  0.8  1    0.8
>>
>> I would like to select those agents that overpass the threshold of 0.9
>> in
>> any the last two periods and are over the threshold until the end of the
>> sample period (ie, agents 4 and 5).
>> I have tried to modify the commands that you have suggested me before,
>> but
>> I have not been able to get the right selection. Would you mind helping
>> me
>> with this? Thank you very much.
>>
>>> I can't follow this.  I see only "the rules select too many agents".
>>>
>>> You tell me your precise rules and I will try to think of code to
>>> implement them.
>>>
>>> Nick
>>> [email protected]
>>>
>>>
>>> On 22 May 2013 18:16, Miguel Angel Duran Munoz <[email protected]> wrote:
>>>> Nick, after reducing the sample using your suggestion, I have checked
>>>> the number of agents that there are per period. And the number is
>>>> increasing in time. I guess this is due to the fact that agents
>>>> joining the sample as time goes by and satisfying the requirement of
>>>> being above the threshold are not excluded. Is there any trick to
>>>> avoid including them? Thanks again.
>>>>
>>>>> Assuming variable names
>>>>>
>>>>> agent  period  score
>>>>>
>>>>> it seems that you want something like
>>>>>
>>>>> bysort agent (period) : gen first3 = _n < 4
>>>>>
>>>>> egen max_first3 = max(score / first3), by(agent)
>>>>>
>>>>> egen min_rest = min(score / !first3), by(agent)
>>>>>
>>>>> keep if max_first3 > 0.9 & min_rest > 0.9
>>>>>
>>>>> For the division trick in the -egen- call see e.g.
>>>>>
>>>>> http://www.stata.com/statalist/archive/2013-03/msg00917.html
>>>>>
>>>>> (reference included therein).
>>>>>
>>>>> Nick
>>>>> [email protected]
>>>>>
>>>>>
>>>>> On 22 May 2013 15:03, Miguel Angel Duran Munoz <[email protected]>
>>>>> wrote:
>>>>>> Nick, thanks for your help. I hope you can help me with another
>>>>>> doubt.
>>>>>> For
>>>>>> a similar analysis to that of my first message, assume I want to
>>>>>> keep those agents that that have overpass the threshold before a
>>>>>> certain period and then have been over it in the rest of the sample
>>>>>> period.
>>>>>>
>>>>>> To illustrate the idea, consider the following (data refer to
>>>>>> consecutive periods and the threshold is, eg, 0.9):
>>>>>>
>>>>>> Agent 1: 1    1    1    1    1...
>>>>>> Agent 2: 0.8  1    1    1    1...
>>>>>> Agent 3: 0.8  0.8  0.8  1    1...
>>>>>> Agent 4: 0.8  0.8  0.8  0.8  1...
>>>>>>
>>>>>> I want to keep the first three agents because they have overpassed
>>>>>> the threshold before period 4 and then have been over the threshold
>>>>>> in the rest of the sample period, but I do not want to keep agent 4.
>>>>>>
>>>>>> Thanks in advance.
>>>>>>
>>>>>> Miguel.
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Correct on -keep-. Sorry about that.
>>>>>>>
>>>>>>> The -sort- order
>>>>>>>
>>>>>>> bysort entity (const_a) :
>>>>>>>
>>>>>>> ensures that -const_a[1]- is the lowest for each agent, not the
>>>>>>> first.
>>>>>>> If the lowest value for each agent is above the threshold, then
>>>>>>> all the observations for that agent  are above.
>>>>>>> Nick
>>>>>>> [email protected]
>>>>>>>
>>>>>>>
>>>>>>> On 21 May 2013 23:16, Miguel Angel Duran Munoz <[email protected]>
>>>>>>> wrote:
>>>>>>>> Thanks, Nick. I guess you mean -keep- instead of -drop-.
>>>>>>>> Nevertheless,
>>>>>>>> the
>>>>>>>> command that you suggest would not guarantee that I keep the
>>>>>>>> agents that have been above the threhsold for the whole sample
>>>>>>>> period (ie, I would be including agents that were above the
>>>>>>>> threshold in the first period and then might have been above or
>>>>>>>> below it).
>>>>>>>>
>>>>>>>>> Sounds like
>>>>>>>>>
>>>>>>>>> bysort entity (const_a) : drop if const_a[1] > 0.09716
>>>>>>>>>
>>>>>>>>> Nick
>>>>>>>>> [email protected]
>>>>>>>>>
>>>>>>>>> On 21 May 2013 23:01, Miguel Angel Duran Munoz <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>> Hi, Statalisters. I want to focus on agents in my dataset that
>>>>>>>>>> have a particular feature; specifically, for those agents, and
>>>>>>>>>> for each and every period (out of 64), the value of a variable
>>>>>>>>>> (const_a) is larger than a particular threshold (0.097116). I
>>>>>>>>>> have done what I show below.
>>>>>>>>>> Nevertheless, I have realized that some of my agents are not in
>>>>>>>>>> the sample since the first period, so what I am doing would
>>>>>>>>>> mistakenly eliminate them. Will anyone help to solve this
>>>>>>>>>> problem? Thanks in advance.
>>>>>>>>>>
>>>>>>>>>> bysort entity (date2): gen obs=_n drop if const_a<0.097116 by
>>>>>>>>>> entity: drop if obs[_N]<64
>>> *
>>> *   For searches and help try:
>>> *   http://www.stata.com/help.cgi?search
>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>> *   http://www.ats.ucla.edu/stat/stata/
>>>
>>
>>
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> *   http://www.ats.ucla.edu/stat/stata/
>>
>>
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> *   http://www.ats.ucla.edu/stat/stata/
>>
>
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/
>


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index