# st: Identifying first observation in each panel after regression

 From Ivan Png To statalist@hsphsun2.harvard.edu Subject st: Identifying first observation in each panel after regression Date Tue, 5 Jun 2012 05:50:18 -0400

```Many thanks.  Sorry, you are right.  I wrote wrongly.  What I meant was that,

When I run the regression, it shows 2773 groups (companies).  But when I run
. gen rdsample = 1 if e(sample)
. by gvkey , sort : gen flag = 1 if _n == 1
/* flag first observation of each company */

. su year if flag == 1 & rdsample == 1
It indicates 1048 unique companies.  I do not understand where are the
other 2773 - 1048 = 1725 companies.

Anyhow, a friend just suggested the following (and it works)

. sort  rdsample gvkey year
. by  rdsample gvkey , sort: gen flag = 1 if rdsample == 1 & _n == 1
. su year if flag == 1
This shows 2773 companies.  I just do not understand why.

On 4 June 2012 22:36, Steve Samuels <sjsamuels@gmail.com> wrote:
> Correction: the "flag2" statement is run after the regression.
> Your claim of discrepancy is false, and you did not test it in the do file, which runs the "by gvkey:" statement only after -xtreg-.
>
> . by gvkey , sort : gen flag1 = 1 if _n ==1   // before the xtreg statement
>
> . by gvkey , sort : gen flag2 = 1 if _n ==1  // after the xtreg statement
>
> tab flag1 flag2, missing
>           |         flag2
>     flag1 |         1          . |     Total
> -----------+----------------------+----------
>        1 |     6,982          0 |     6,982
>        . |         0     70,797 |    70,797
> -----------+----------------------+----------
>    Total |     6,982     70,797 |    77,779
> On Jun 4, 2012, at 8:13 PM, Ivan Png wrote:
>
> Thanks, Nick.
>
> Here's the code
>
> And here's the data
>
>
> On 4 June 2012 19:00, Nick Cox <njcoxstata@gmail.com> wrote:
>> It should make absolutely no difference whether you do this before or
>> after a regression. I think we need to see evidence of what you think
>> is happening in terms of a dataset you provide in its entirety or
>> your puzzlement with Stata tech-support. They would want a copy of
>> On Mon, Jun 4, 2012 at 11:43 PM, Ivan Png <iplpng@gmail.com> wrote:
>>> What I don't understand: Why the
>>>
>>> . by gvkey , sort : gen flag = 1 if _n ==1
>>>
>>> works when I invoke it before the regression (it then picks up the
>>> first observation of each company), but not when I invoke it after the
>>> regression (it misses many companies).
>>>
>>> I used exactly the same command in both cases.
>>>
>>>
>>> On 4 June 2012 18:31, Nick Cox <njcoxstata@gmail.com> wrote:
>>>> Which bit don't you understand?
>>>>
>>>> On Mon, Jun 4, 2012 at 11:16 PM, Ivan Png <iplpng@gmail.com> wrote:
>>>>> Dear Nick--
>>>>>
>>>>> Many thanks for your hint.  I found the solution.  I execute
>>>>> . by gvkey , sort: gen flag = 1 if  _n == 1
>>>>> before the regression.
>>>>>
>>>>> Then, after the regression, I execute
>>>>> . gen regsample == 1 if e(sample)
>>>>>
>>>>> And, to identify the first observation of each company in the
>>>>> regression sample, I use
>>>>>  regsample == 1 & flag == 1
>>>>>
>>>>> However, I still don't understand the reason it works.
>>>>>
>>>>>
>>>>> On 4 June 2012 14:24, Nick Cox <njcoxstata@gmail.com> wrote:
>>>>>> What code do you mean by "the code below"?
>>>>>>
>>>>>> I suspect there's something else up with your dataset that leads to
>>>>>> what you see. Examine the data omitted by
>>>>>>
>>>>>> . edit if !e(sample)
>>>>>>
>>>>>>
>>>>>> On Mon, Jun 4, 2012 at 6:44 PM, Ivan Png <iplpng@gmail.com> wrote:
>>>>>>> Many thanks, Nick.  Incidentally, thanks for the yeoman service to all
>>>>>>> STATAlisters.
>>>>>>>
>>>>>>> The discrepancy I found was by using xtreg to run a fixed-effects
>>>>>>> regression on the sample.  xtreg reported 2773 companies.  Yet, when I
>>>>>>> used the code below on the regression sample, I got only 1048
>>>>>>> companies.  So, the only reason I could think of was that the flag
>>>>>>> identified only companies that were present in year 1.
>>>>>>
>>>>>> On 4 June 2012 13:21, Nick Cox <n.j.cox@durham.ac.uk> wrote:
>>>>>>
>>>>>>>> Your code looks fine to me, so I have difficulty understanding why you think it doesn't work.
>>>>>>>>
>>>>>>>> The -sort- on the second command is unnecessary given the previous command, but I don't see that it will change the sort order.
>>>>>>>>
>>>>>>>> You can check logic in terms of this example:
>>>>>>>>
>>>>>>>> . webuse grunfeld
>>>>>>>>
>>>>>>>> . su year
>>>>>>>>
>>>>>>>>   Variable |       Obs        Mean    Std. Dev.       Min        Max
>>>>>>>> -------------+--------------------------------------------------------
>>>>>>>>       year |       200      1944.5    5.780751       1935       1954
>>>>>>>>
>>>>>>>> . drop if year == 1935 & mod(company, 2)
>>>>>>>> (5 observations deleted)
>>>>>>>>
>>>>>>>> . tab year
>>>>>>>>
>>>>>>>>      year |      Freq.     Percent        Cum.
>>>>>>>> ------------+-----------------------------------
>>>>>>>>      1935 |          5        2.56        2.56
>>>>>>>>      1936 |         10        5.13        7.69
>>>>>>>>      1937 |         10        5.13       12.82
>>>>>>>>      1938 |         10        5.13       17.95
>>>>>>>>      1939 |         10        5.13       23.08
>>>>>>>>      1940 |         10        5.13       28.21
>>>>>>>>      1941 |         10        5.13       33.33
>>>>>>>>      1942 |         10        5.13       38.46
>>>>>>>>      1943 |         10        5.13       43.59
>>>>>>>>      1944 |         10        5.13       48.72
>>>>>>>>      1945 |         10        5.13       53.85
>>>>>>>>      1946 |         10        5.13       58.97
>>>>>>>>      1947 |         10        5.13       64.10
>>>>>>>>      1948 |         10        5.13       69.23
>>>>>>>>      1949 |         10        5.13       74.36
>>>>>>>>      1950 |         10        5.13       79.49
>>>>>>>>      1951 |         10        5.13       84.62
>>>>>>>>      1952 |         10        5.13       89.74
>>>>>>>>      1953 |         10        5.13       94.87
>>>>>>>>      1954 |         10        5.13      100.00
>>>>>>>> ------------+-----------------------------------
>>>>>>>>     Total |        195      100.00
>>>>>>>>
>>>>>>>> . bysort company (year) : gen first = _n == 1
>>>>>>>>
>>>>>>>> . l company year  if first
>>>>>>>>
>>>>>>>>    +----------------+
>>>>>>>>    | company   year |
>>>>>>>>    |----------------|
>>>>>>>> 1. |       1   1936 |
>>>>>>>> 20. |       2   1935 |
>>>>>>>> 40. |       3   1936 |
>>>>>>>> 59. |       4   1935 |
>>>>>>>> 79. |       5   1936 |
>>>>>>>>    |----------------|
>>>>>>>> 98. |       6   1935 |
>>>>>>>> 118. |       7   1936 |
>>>>>>>> 137. |       8   1935 |
>>>>>>>> 157. |       9   1936 |
>>>>>>>> 176. |      10   1935 |
>>>>>>>>    +----------------+
>>>>>>>>
>>>>>>>>
>>>>>>>> Ivan Png
>>>>>>>>
>>>>>>>> I am analyzing an unbalanced panel of company data, organized by
>>>>>>>> company (gvkey) and year.  I want to create  a flag to the first
>>>>>>>> observation of each company in the panel.  I tried
>>>>>>>>
>>>>>>>> . sort gvkey year
>>>>>>>> . by gvkey , sort: gen flag = 1 if  _n == 1
>>>>>>>>
>>>>>>>> However, this only flagged flag = 1 if a company was present in year 1
>>>>>>>> of the panel.  It missed any company that appeared in later years.
>>>>>>>>
>>>>>>>> I searched statalist and found this:
>>>>>>>> http://www.stata.com/statalist/archive/2005-04/msg00334.html
>>>>>>>>
>>>>>>>> But it doesn't work.  I'd be grateful for any relevant help.
