Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

# Re: st: Observations that keep a feature...

 From Nick Cox To "statalist@hsphsun2.harvard.edu" Subject Re: st: Observations that keep a feature... Date Thu, 23 May 2013 19:24:57 +0100

```This is getting very intricate to follow.

As Sarah posted yesterday, more or less, we need examples.

I worry on your behalf that you will have to explain your rules to
somebody reviewing your thesis/dissertation/report/paper and they are
going to ask you why you couldn't use much simpler rules.

Nick
njcoxstata@gmail.com

On 23 May 2013 18:43, Miguel Angel Duran Munoz <maduran@uma.es> wrote:
> Nick and Sarah, thanks to your help I've been able to solve all but one of
> my problems. To select agents that are above the threshold after period 2,
> I've finally used:
>
> egen firstperiod = min(period), by(agent)
> drop if firstperiod > 2
> bysort agent (period): gen first2 = _n < 3
> egen min_rest = min(score / !first2), by(agent)
> keep if min_rest >= 0.9
>
> (the max condition that Nick suggested me is, I think, unnecessary)
>
> Nevertheless, I am not sure about how to select agents that overpass the
> threshold in the final periods (say at or after t3) and maintain over it.
> In principle, based on your suggestions, I thought of this:
>
> bysort agent (period): gen last=score[_N]
> bysort entity (date2): gen first2 = _n < 3
> egen min_rest = min(score / !first2), by(agent)
> keep if last>=0.9 & min_rest<=0.9
>
> Nevertheless, this implies that I am excluding agents that satisfy the
> criterion (overpassing the threshold at or after t3) but appear in the
> sample at an intermediate period.
>
>
> Miguel.
>
>> Sarah, thank you for your help. I am very sorry for not having put my
>> doubts in a sufficiently clear way. And given what you say about the way
>> data is stored I have realized that there might be other problems around.
>> I will try to be as clear as possible.
>>
>> My data is in panel data form. I write the example down again in the way
>> my data is stored. As regards the example in my previous messages, I add
>> two agents (6 and 7). Please note also that data referring to agent fifth
>> is missing in some periods, but there is no line corresponding to those
>> periods (this is what I had not taken into account so far):
>>
>> time  agent   score
>> t1     1      0.8
>> t2     1      1
>> t3     1      1
>> t4     1      1
>> t5     1      1
>> t6     1      1
>>
>> t1     2      0.8
>> t2     2      0.8
>> t3     2      1
>> t4     2      1
>> t5     2      1
>> t6     2      1
>>
>> t1     3      0.8
>> t2     3      0.8
>> t3     3      0.8
>> t4     3      1
>> t5     3      1
>> t6     3      1
>>
>> t1     4      0.8
>> t2     4      0.8
>> t3     4      0.8
>> t4     4      0.8
>> t5     4      1
>> t6     4      1
>>
>> t6     5      1
>>
>> t1     6      0.8
>> t2     6      0.8
>> t3     6      0.8
>> t4     6      0.8
>> t5     6      1
>> t6     6      1
>>
>> t1     7      0.8
>> t2     7      1
>> t3     7      1
>> t4     7      0.8
>> t5     7      0.8
>> t6     7      1
>>
>> Having said that, I want to split the sample in different ways. First, I
>> want to focus on agents that overpass a threshold (eg, 0.9) since the
>> first period and are always above the threhold (ie, agent 1). Second, I
>> want to take agents that overpass the threshold at or before a particular
>> period (eg, t3) and since then they are above the threshold (ie, agents
>> 1-4). Third, agents that overpass the threshold at or after a particular
>> period (eg, t5) and since then they are above the threshold (ie, agents 5
>> and 6). Please note that agent 7 is not included in any of the previous
>> subsamples.
>>
>> Thank you very much for your help. And once again, I am sorry for not
>> having been clear enough.
>>
>> Miguel.
>>
>>
>>
>>
>>> Miguel,
>>> This discussion would be clearer if your examples actually made it clear
>>> exactly what your data looks like.
>>>
>>> Your example below looks like you have data in wide form.  The solution
>>> that Nick suggested is for data in long form.  It's easy enough to move
>>> between the two, but it's hard to make concrete suggestions about how to
>>> proceed when we don't know what the actual data looks like.
>>>
>>> I'll start by assuming, as Nick does, that your data is actually in long
>>> form and you have three variables: agent, period, score.  I'll further
>>> assume that for agent 5 you simply have no records for periods 1-5 (that
>>> is, you do not have records for those periods with missing values for
>>> score).  If that's true, you can simply calculate the first period that
>>> appears in the data and use that as part of your inclusion criteria.
>>> Something like the following will keep only those agents who first
>>> appear
>>> in the data before period 4:
>>> egen firstperiod=min(period), by(agent)
>>> drop if firstperiod>4
>>>
>>> Or maybe you only want to include agents who start in period 1?  It's
>>> unclear from your question.  In that case you'd -drop if firstperiod>1-
>>>
>>> For your second example, trying to look at the last time periods, I
>>> think
>>> you need to clarify what your actual criteria is.  You say "I would like
>>> to select those agents that overpass the threshold of 0.9 in any the
>>> last
>>> two periods and are over the threshold until the end of the sample
>>> period
>>> (ie, agents 4 and 5)."  To my eye, that criteria includes all agents
>>> except agent 6.  You're unlikely to get the results you hope for unless
>>> you are precise in the criteria you're using.
>>>
>>> Hope that helps.
>>>
>>> -Sarah
>>>
>>>
>>> -----Original Message-----
>>> From: owner-statalist@hsphsun2.harvard.edu
>>> [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of Miguel Angel
>>> Duran Munoz
>>> Sent: Wednesday, May 22, 2013 11:00 AM
>>> To: statalist@hsphsun2.harvard.edu
>>> Subject: Re: st: Observations that keep a feature... an additional
>>> problem
>>>
>>> I use the same example than in a previous message, but I add a fifth
>>> agent
>>> that joins in period six:
>>>
>>>
>>> Agent 1: 1    1    1    1    1    1...
>>> Agent 2: 0.8  1    1    1    1    1...
>>> Agent 3: 0.8  0.8  0.8  1    1    1...
>>> Agent 4: 0.8  0.8  0.8  0.8  1    1...
>>> Agent 5:  .    .    .    .   .    1...
>>>
>>> I want to keep just the first three agents.
>>>
>>>
>>> If you don't mind, Nick, I would also like to ask you the following. I
>>> take the same example, but I focus on the last periods.
>>>
>>> Agent 1: ...1    1    1    1    1    1
>>> Agent 2: ...0.8  1    1    1    1    1
>>> Agent 3: ...0.8  0.8  0.8  1    1    1
>>> Agent 4: ...0.8  0.8  0.8  0.8  1    1
>>> Agent 5: ... .    .    .    .   .    1
>>> Agent 6: ...0.8  0.8  0.8  0.8  1    0.8
>>>
>>> I would like to select those agents that overpass the threshold of 0.9
>>> in
>>> any the last two periods and are over the threshold until the end of the
>>> sample period (ie, agents 4 and 5).
>>> I have tried to modify the commands that you have suggested me before,
>>> but
>>> I have not been able to get the right selection. Would you mind helping
>>> me
>>> with this? Thank you very much.
>>>
>>>> I can't follow this.  I see only "the rules select too many agents".
>>>>
>>>> You tell me your precise rules and I will try to think of code to
>>>> implement them.
>>>>
>>>> Nick
>>>> njcoxstata@gmail.com
>>>>
>>>>
>>>> On 22 May 2013 18:16, Miguel Angel Duran Munoz <maduran@uma.es> wrote:
>>>>> Nick, after reducing the sample using your suggestion, I have checked
>>>>> the number of agents that there are per period. And the number is
>>>>> increasing in time. I guess this is due to the fact that agents
>>>>> joining the sample as time goes by and satisfying the requirement of
>>>>> being above the threshold are not excluded. Is there any trick to
>>>>> avoid including them? Thanks again.
>>>>>
>>>>>> Assuming variable names
>>>>>>
>>>>>> agent  period  score
>>>>>>
>>>>>> it seems that you want something like
>>>>>>
>>>>>> bysort agent (period) : gen first3 = _n < 4
>>>>>>
>>>>>> egen max_first3 = max(score / first3), by(agent)
>>>>>>
>>>>>> egen min_rest = min(score / !first3), by(agent)
>>>>>>
>>>>>> keep if max_first3 > 0.9 & min_rest > 0.9
>>>>>>
>>>>>> For the division trick in the -egen- call see e.g.
>>>>>>
>>>>>> http://www.stata.com/statalist/archive/2013-03/msg00917.html
>>>>>>
>>>>>> (reference included therein).
>>>>>>
>>>>>> Nick
>>>>>> njcoxstata@gmail.com
>>>>>>
>>>>>>
>>>>>> On 22 May 2013 15:03, Miguel Angel Duran Munoz <maduran@uma.es>
>>>>>> wrote:
>>>>>>> Nick, thanks for your help. I hope you can help me with another
>>>>>>> doubt.
>>>>>>> For
>>>>>>> a similar analysis to that of my first message, assume I want to
>>>>>>> keep those agents that that have overpass the threshold before a
>>>>>>> certain period and then have been over it in the rest of the sample
>>>>>>> period.
>>>>>>>
>>>>>>> To illustrate the idea, consider the following (data refer to
>>>>>>> consecutive periods and the threshold is, eg, 0.9):
>>>>>>>
>>>>>>> Agent 1: 1    1    1    1    1...
>>>>>>> Agent 2: 0.8  1    1    1    1...
>>>>>>> Agent 3: 0.8  0.8  0.8  1    1...
>>>>>>> Agent 4: 0.8  0.8  0.8  0.8  1...
>>>>>>>
>>>>>>> I want to keep the first three agents because they have overpassed
>>>>>>> the threshold before period 4 and then have been over the threshold
>>>>>>> in the rest of the sample period, but I do not want to keep agent 4.
>>>>>>>
>>>>>>>
>>>>>>> Miguel.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> Correct on -keep-. Sorry about that.
>>>>>>>>
>>>>>>>> The -sort- order
>>>>>>>>
>>>>>>>> bysort entity (const_a) :
>>>>>>>>
>>>>>>>> ensures that -const_a[1]- is the lowest for each agent, not the
>>>>>>>> first.
>>>>>>>> If the lowest value for each agent is above the threshold, then
>>>>>>>> all the observations for that agent  are above.
>>>>>>>> Nick
>>>>>>>> njcoxstata@gmail.com
>>>>>>>>
>>>>>>>>
>>>>>>>> On 21 May 2013 23:16, Miguel Angel Duran Munoz <maduran@uma.es>
>>>>>>>> wrote:
>>>>>>>>> Thanks, Nick. I guess you mean -keep- instead of -drop-.
>>>>>>>>> Nevertheless,
>>>>>>>>> the
>>>>>>>>> command that you suggest would not guarantee that I keep the
>>>>>>>>> agents that have been above the threhsold for the whole sample
>>>>>>>>> period (ie, I would be including agents that were above the
>>>>>>>>> threshold in the first period and then might have been above or
>>>>>>>>> below it).
>>>>>>>>>
>>>>>>>>>> Sounds like
>>>>>>>>>>
>>>>>>>>>> bysort entity (const_a) : drop if const_a[1] > 0.09716
>>>>>>>>>>
>>>>>>>>>> Nick
>>>>>>>>>> njcoxstata@gmail.com
>>>>>>>>>>
>>>>>>>>>> On 21 May 2013 23:01, Miguel Angel Duran Munoz <maduran@uma.es>
>>>>>>>>>> wrote:
>>>>>>>>>>> Hi, Statalisters. I want to focus on agents in my dataset that
>>>>>>>>>>> have a particular feature; specifically, for those agents, and
>>>>>>>>>>> for each and every period (out of 64), the value of a variable
>>>>>>>>>>> (const_a) is larger than a particular threshold (0.097116). I
>>>>>>>>>>> have done what I show below.
>>>>>>>>>>> Nevertheless, I have realized that some of my agents are not in
>>>>>>>>>>> the sample since the first period, so what I am doing would
>>>>>>>>>>> mistakenly eliminate them. Will anyone help to solve this
>>>>>>>>>>>
>>>>>>>>>>> bysort entity (date2): gen obs=_n drop if const_a<0.097116 by
>>>>>>>>>>> entity: drop if obs[_N]<64
>>>> *
>>>> *   For searches and help try:
>>>> *   http://www.stata.com/help.cgi?search
>>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>>> *   http://www.ats.ucla.edu/stat/stata/
>>>>
>>>
>>>
>>> *
>>> *   For searches and help try:
>>> *   http://www.stata.com/help.cgi?search
>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>> *   http://www.ats.ucla.edu/stat/stata/
>>>
>>>
>>> *
>>> *   For searches and help try:
>>> *   http://www.stata.com/help.cgi?search
>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>> *   http://www.ats.ucla.edu/stat/stata/
>>>
>>
>>
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> *   http://www.ats.ucla.edu/stat/stata/
>>
>
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/
```