Notice: On March 31, it was **announced** that Statalist is moving from an email list to a **forum**. The old list will shut down at the end of May, and its replacement, **statalist.org** is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Nick Cox <njcoxstata@gmail.com> |

To |
"statalist@hsphsun2.harvard.edu" <statalist@hsphsun2.harvard.edu> |

Subject |
Re: st: Observations that keep a feature... |

Date |
Thu, 23 May 2013 19:24:57 +0100 |

This is getting very intricate to follow. As Sarah posted yesterday, more or less, we need examples. I worry on your behalf that you will have to explain your rules to somebody reviewing your thesis/dissertation/report/paper and they are going to ask you why you couldn't use much simpler rules. Nick njcoxstata@gmail.com On 23 May 2013 18:43, Miguel Angel Duran Munoz <maduran@uma.es> wrote: > Nick and Sarah, thanks to your help I've been able to solve all but one of > my problems. To select agents that are above the threshold after period 2, > I've finally used: > > egen firstperiod = min(period), by(agent) > drop if firstperiod > 2 > bysort agent (period): gen first2 = _n < 3 > egen min_rest = min(score / !first2), by(agent) > keep if min_rest >= 0.9 > > (the max condition that Nick suggested me is, I think, unnecessary) > > Nevertheless, I am not sure about how to select agents that overpass the > threshold in the final periods (say at or after t3) and maintain over it. > In principle, based on your suggestions, I thought of this: > > bysort agent (period): gen last=score[_N] > bysort entity (date2): gen first2 = _n < 3 > egen min_rest = min(score / !first2), by(agent) > keep if last>=0.9 & min_rest<=0.9 > > Nevertheless, this implies that I am excluding agents that satisfy the > criterion (overpassing the threshold at or after t3) but appear in the > sample at an intermediate period. > > Will someone please help to solve this? Thanks in advance. > > Miguel. > >> Sarah, thank you for your help. I am very sorry for not having put my >> doubts in a sufficiently clear way. And given what you say about the way >> data is stored I have realized that there might be other problems around. >> I will try to be as clear as possible. >> >> My data is in panel data form. I write the example down again in the way >> my data is stored. As regards the example in my previous messages, I add >> two agents (6 and 7). Please note also that data referring to agent fifth >> is missing in some periods, but there is no line corresponding to those >> periods (this is what I had not taken into account so far): >> >> time agent score >> t1 1 0.8 >> t2 1 1 >> t3 1 1 >> t4 1 1 >> t5 1 1 >> t6 1 1 >> >> t1 2 0.8 >> t2 2 0.8 >> t3 2 1 >> t4 2 1 >> t5 2 1 >> t6 2 1 >> >> t1 3 0.8 >> t2 3 0.8 >> t3 3 0.8 >> t4 3 1 >> t5 3 1 >> t6 3 1 >> >> t1 4 0.8 >> t2 4 0.8 >> t3 4 0.8 >> t4 4 0.8 >> t5 4 1 >> t6 4 1 >> >> t6 5 1 >> >> t1 6 0.8 >> t2 6 0.8 >> t3 6 0.8 >> t4 6 0.8 >> t5 6 1 >> t6 6 1 >> >> t1 7 0.8 >> t2 7 1 >> t3 7 1 >> t4 7 0.8 >> t5 7 0.8 >> t6 7 1 >> >> Having said that, I want to split the sample in different ways. First, I >> want to focus on agents that overpass a threshold (eg, 0.9) since the >> first period and are always above the threhold (ie, agent 1). Second, I >> want to take agents that overpass the threshold at or before a particular >> period (eg, t3) and since then they are above the threshold (ie, agents >> 1-4). Third, agents that overpass the threshold at or after a particular >> period (eg, t5) and since then they are above the threshold (ie, agents 5 >> and 6). Please note that agent 7 is not included in any of the previous >> subsamples. >> >> Thank you very much for your help. And once again, I am sorry for not >> having been clear enough. >> >> Miguel. >> >> >> >> >>> Miguel, >>> This discussion would be clearer if your examples actually made it clear >>> exactly what your data looks like. >>> >>> Your example below looks like you have data in wide form. The solution >>> that Nick suggested is for data in long form. It's easy enough to move >>> between the two, but it's hard to make concrete suggestions about how to >>> proceed when we don't know what the actual data looks like. >>> >>> I'll start by assuming, as Nick does, that your data is actually in long >>> form and you have three variables: agent, period, score. I'll further >>> assume that for agent 5 you simply have no records for periods 1-5 (that >>> is, you do not have records for those periods with missing values for >>> score). If that's true, you can simply calculate the first period that >>> appears in the data and use that as part of your inclusion criteria. >>> Something like the following will keep only those agents who first >>> appear >>> in the data before period 4: >>> egen firstperiod=min(period), by(agent) >>> drop if firstperiod>4 >>> >>> Or maybe you only want to include agents who start in period 1? It's >>> unclear from your question. In that case you'd -drop if firstperiod>1- >>> >>> For your second example, trying to look at the last time periods, I >>> think >>> you need to clarify what your actual criteria is. You say "I would like >>> to select those agents that overpass the threshold of 0.9 in any the >>> last >>> two periods and are over the threshold until the end of the sample >>> period >>> (ie, agents 4 and 5)." To my eye, that criteria includes all agents >>> except agent 6. You're unlikely to get the results you hope for unless >>> you are precise in the criteria you're using. >>> >>> Hope that helps. >>> >>> -Sarah >>> >>> >>> -----Original Message----- >>> From: owner-statalist@hsphsun2.harvard.edu >>> [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of Miguel Angel >>> Duran Munoz >>> Sent: Wednesday, May 22, 2013 11:00 AM >>> To: statalist@hsphsun2.harvard.edu >>> Subject: Re: st: Observations that keep a feature... an additional >>> problem >>> >>> I use the same example than in a previous message, but I add a fifth >>> agent >>> that joins in period six: >>> >>> >>> Agent 1: 1 1 1 1 1 1... >>> Agent 2: 0.8 1 1 1 1 1... >>> Agent 3: 0.8 0.8 0.8 1 1 1... >>> Agent 4: 0.8 0.8 0.8 0.8 1 1... >>> Agent 5: . . . . . 1... >>> >>> I want to keep just the first three agents. >>> >>> >>> If you don't mind, Nick, I would also like to ask you the following. I >>> take the same example, but I focus on the last periods. >>> >>> Agent 1: ...1 1 1 1 1 1 >>> Agent 2: ...0.8 1 1 1 1 1 >>> Agent 3: ...0.8 0.8 0.8 1 1 1 >>> Agent 4: ...0.8 0.8 0.8 0.8 1 1 >>> Agent 5: ... . . . . . 1 >>> Agent 6: ...0.8 0.8 0.8 0.8 1 0.8 >>> >>> I would like to select those agents that overpass the threshold of 0.9 >>> in >>> any the last two periods and are over the threshold until the end of the >>> sample period (ie, agents 4 and 5). >>> I have tried to modify the commands that you have suggested me before, >>> but >>> I have not been able to get the right selection. Would you mind helping >>> me >>> with this? Thank you very much. >>> >>>> I can't follow this. I see only "the rules select too many agents". >>>> >>>> You tell me your precise rules and I will try to think of code to >>>> implement them. >>>> >>>> Nick >>>> njcoxstata@gmail.com >>>> >>>> >>>> On 22 May 2013 18:16, Miguel Angel Duran Munoz <maduran@uma.es> wrote: >>>>> Nick, after reducing the sample using your suggestion, I have checked >>>>> the number of agents that there are per period. And the number is >>>>> increasing in time. I guess this is due to the fact that agents >>>>> joining the sample as time goes by and satisfying the requirement of >>>>> being above the threshold are not excluded. Is there any trick to >>>>> avoid including them? Thanks again. >>>>> >>>>>> Assuming variable names >>>>>> >>>>>> agent period score >>>>>> >>>>>> it seems that you want something like >>>>>> >>>>>> bysort agent (period) : gen first3 = _n < 4 >>>>>> >>>>>> egen max_first3 = max(score / first3), by(agent) >>>>>> >>>>>> egen min_rest = min(score / !first3), by(agent) >>>>>> >>>>>> keep if max_first3 > 0.9 & min_rest > 0.9 >>>>>> >>>>>> For the division trick in the -egen- call see e.g. >>>>>> >>>>>> http://www.stata.com/statalist/archive/2013-03/msg00917.html >>>>>> >>>>>> (reference included therein). >>>>>> >>>>>> Nick >>>>>> njcoxstata@gmail.com >>>>>> >>>>>> >>>>>> On 22 May 2013 15:03, Miguel Angel Duran Munoz <maduran@uma.es> >>>>>> wrote: >>>>>>> Nick, thanks for your help. I hope you can help me with another >>>>>>> doubt. >>>>>>> For >>>>>>> a similar analysis to that of my first message, assume I want to >>>>>>> keep those agents that that have overpass the threshold before a >>>>>>> certain period and then have been over it in the rest of the sample >>>>>>> period. >>>>>>> >>>>>>> To illustrate the idea, consider the following (data refer to >>>>>>> consecutive periods and the threshold is, eg, 0.9): >>>>>>> >>>>>>> Agent 1: 1 1 1 1 1... >>>>>>> Agent 2: 0.8 1 1 1 1... >>>>>>> Agent 3: 0.8 0.8 0.8 1 1... >>>>>>> Agent 4: 0.8 0.8 0.8 0.8 1... >>>>>>> >>>>>>> I want to keep the first three agents because they have overpassed >>>>>>> the threshold before period 4 and then have been over the threshold >>>>>>> in the rest of the sample period, but I do not want to keep agent 4. >>>>>>> >>>>>>> Thanks in advance. >>>>>>> >>>>>>> Miguel. >>>>>>> >>>>>>> >>>>>>> >>>>>>>> Correct on -keep-. Sorry about that. >>>>>>>> >>>>>>>> The -sort- order >>>>>>>> >>>>>>>> bysort entity (const_a) : >>>>>>>> >>>>>>>> ensures that -const_a[1]- is the lowest for each agent, not the >>>>>>>> first. >>>>>>>> If the lowest value for each agent is above the threshold, then >>>>>>>> all the observations for that agent are above. >>>>>>>> Nick >>>>>>>> njcoxstata@gmail.com >>>>>>>> >>>>>>>> >>>>>>>> On 21 May 2013 23:16, Miguel Angel Duran Munoz <maduran@uma.es> >>>>>>>> wrote: >>>>>>>>> Thanks, Nick. I guess you mean -keep- instead of -drop-. >>>>>>>>> Nevertheless, >>>>>>>>> the >>>>>>>>> command that you suggest would not guarantee that I keep the >>>>>>>>> agents that have been above the threhsold for the whole sample >>>>>>>>> period (ie, I would be including agents that were above the >>>>>>>>> threshold in the first period and then might have been above or >>>>>>>>> below it). >>>>>>>>> >>>>>>>>>> Sounds like >>>>>>>>>> >>>>>>>>>> bysort entity (const_a) : drop if const_a[1] > 0.09716 >>>>>>>>>> >>>>>>>>>> Nick >>>>>>>>>> njcoxstata@gmail.com >>>>>>>>>> >>>>>>>>>> On 21 May 2013 23:01, Miguel Angel Duran Munoz <maduran@uma.es> >>>>>>>>>> wrote: >>>>>>>>>>> Hi, Statalisters. I want to focus on agents in my dataset that >>>>>>>>>>> have a particular feature; specifically, for those agents, and >>>>>>>>>>> for each and every period (out of 64), the value of a variable >>>>>>>>>>> (const_a) is larger than a particular threshold (0.097116). I >>>>>>>>>>> have done what I show below. >>>>>>>>>>> Nevertheless, I have realized that some of my agents are not in >>>>>>>>>>> the sample since the first period, so what I am doing would >>>>>>>>>>> mistakenly eliminate them. Will anyone help to solve this >>>>>>>>>>> problem? Thanks in advance. >>>>>>>>>>> >>>>>>>>>>> bysort entity (date2): gen obs=_n drop if const_a<0.097116 by >>>>>>>>>>> entity: drop if obs[_N]<64 >>>> * >>>> * For searches and help try: >>>> * http://www.stata.com/help.cgi?search >>>> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>>> * http://www.ats.ucla.edu/stat/stata/ >>>> >>> >>> >>> * >>> * For searches and help try: >>> * http://www.stata.com/help.cgi?search >>> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>> * http://www.ats.ucla.edu/stat/stata/ >>> >>> >>> * >>> * For searches and help try: >>> * http://www.stata.com/help.cgi?search >>> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>> * http://www.ats.ucla.edu/stat/stata/ >>> >> >> >> * >> * For searches and help try: >> * http://www.stata.com/help.cgi?search >> * http://www.stata.com/support/faqs/resources/statalist-faq/ >> * http://www.ats.ucla.edu/stat/stata/ >> > > > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/faqs/resources/statalist-faq/ > * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**Re: st: Observations that keep a feature...***From:*"Miguel Angel Duran Munoz" <maduran@uma.es>

**References**:**st: Observations that keep a feature in the whole sample period***From:*"Miguel Angel Duran Munoz" <maduran@uma.es>

**Re: st: Observations that keep a feature in the whole sample period***From:*Nick Cox <njcoxstata@gmail.com>

**Re: st: Observations that keep a feature in the whole sample period***From:*"Miguel Angel Duran Munoz" <maduran@uma.es>

**Re: st: Observations that keep a feature in the whole sample period***From:*Nick Cox <njcoxstata@gmail.com>

**Re: st: Observations that keep a feature in the whole sample period***From:*"Miguel Angel Duran Munoz" <maduran@uma.es>

**Re: st: Observations that keep a feature in the whole sample period***From:*Nick Cox <njcoxstata@gmail.com>

**Re: st: Observations that keep a feature... an additional problem***From:*"Miguel Angel Duran Munoz" <maduran@uma.es>

**Re: st: Observations that keep a feature... an additional problem***From:*Nick Cox <njcoxstata@gmail.com>

**Re: st: Observations that keep a feature... an additional problem***From:*"Miguel Angel Duran Munoz" <maduran@uma.es>

**RE: st: Observations that keep a feature... an additional problem***From:*"Sarah Edgington" <sedging@ucla.edu>

**RE: st: Observations that keep a feature... an additional problem***From:*"Miguel Angel Duran Munoz" <maduran@uma.es>

**RE: st: Observations that keep a feature...***From:*"Miguel Angel Duran Munoz" <maduran@uma.es>

- Prev by Date:
**Re: st: Need to Split String Variable** - Next by Date:
**st: Use of local macros when generating new variables (-rowranks-)** - Previous by thread:
**RE: st: Observations that keep a feature...** - Next by thread:
**Re: st: Observations that keep a feature...** - Index(es):