Statalist The Stata Listserver


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

RE: st: Encode/destring


From   "Nick Cox" <[email protected]>
To   <[email protected]>
Subject   RE: st: Encode/destring
Date   Sat, 25 Feb 2006 18:05:30 -0000

Despite the title, the issue here is one-to-one mapping 
from string identifiers to numeric identifiers. 

As Giorgia points out, -destring, ignore- is quite wrong for 
her problem, as ignoring the non-numeric characters throws away 
important information. 

Joseph's solution is a reinvention of -egen, group()-. 
It shows the logic to follow, but for convenience 
you can do it directly: 

egen numeric_panel_id = group(string_panel_id) 

(Incidentally, keeping track of all the non-numeric
characters in a string variable is not that difficult. 
A utility -charlist- on SSC is dedicated to this 
small question.) 

(Giorgia: the Statalist FAQ explains the Statalist 
convention of using -cmdname- to refer to a command
of that name.) 

Nick
[email protected] 

Joseph Coveney

> First, generate a numeric variable that takes the value one 
> at the first
> observation of a (sorted) panel unit, and zero at all succeeding
> observations of that panel unit.  Then -sum()- the numeric 
> variable across
> the dataset.  The technique is illustrated below with dummy 
> data of about 150 000 panel units.
> 
> clear
> set more off
> set seed `=date("2006-02-25", "ymd")'
> set obs 150000
> generate str panel_unit = string(uniform(), "%19.18g")
> *
> * Begin here
> *
> bysort panel_unit: generate byte panel_number = _n == 1
> replace panel_number = sum(panel_number)
> exit
> 

Giorgia Maffini 

> I am working with a panel of more than 70,000 firms.
> When running FE and RE I need to specify the panel unit (firms in my
> dataset). The panel unit has to be recorded a numeric variable, as I
> understand.
> 
> In my data the firm idendifier is a STRING variable with both 
> numbers and
> letters. Example: firm with identifier FR12345 is different 
> from firm with identifier GB12345.
> 
> I used DESTRING-IGNORE but
> 1) it is difficult to track down all the characters present 
> in the firm identifier variable
> 2) Different firms will get the same id number. Example: FR12345 and
> GB12345.
> 
> I used ENCODE but I got the following error message (134): 
> You attempted to
> encode a string variable that takes on more than 65,536 unique values.
 

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index