Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Dropping Alphanumeric elements from variables


From   Nick Cox <njcoxstata@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Dropping Alphanumeric elements from variables
Date   Thu, 7 Feb 2013 18:29:57 +0000

For completeness a -substr()- approach may be of interest.

Jeffrey said villages are the first 3 letters, so that is

gen village_id = substr(id, 1,3)

and the individual ids are the last 5 characters

gen indiv_id - substr(id, -5, 5)

So you can put them together with

egen newid = group(village_id indiv_id), label

Still, -moss- can crack harder problems than this.

On Thu, Feb 7, 2013 at 2:43 PM, Nick Cox <njcoxstata@gmail.com> wrote:
> That is, it sounds as if
>
> egen newid = group(a_match1 n_match2), label
>
> could work well for you. For more explanation, please see the 2007
> paper whose URL is given below.
>
> Nick
>
> On Thu, Feb 7, 2013 at 2:27 PM, Nick Cox <njcoxstata@gmail.com> wrote:
>> You could parse your identifiers using -substr()- to split out parts.
>> I've found many times that people underestimate the possibilities of
>> the very simplest string functions. There is a tutorial on functions
>> often neglected in a 2011 paper
>>
>> http://www.stata-journal.com/article.html?article=dm0058
>>
>> Or you could use regular expression tools. -moss- from SSC could work
>> with examples like this:
>>
>> . l
>>
>>      +-------------+
>>      |          id |
>>      |-------------|
>>   1. | BTG09A00001 |
>>   2. | BTG10A00001 |
>>   3. | BGM09A00027 |
>>   4. | BGM10A00027 |
>>      +-------------+
>>
>> . moss id, match("([0-9]+)") regex prefix(n_)
>>
>> . moss id, match("([A-Z]+)") regex prefix(a_)
>>
>> . l id *match*
>>
>>      +---------------------------------------------------------+
>>      |          id   n_match1   n_match2   a_match1   a_match2 |
>>      |---------------------------------------------------------|
>>   1. | BTG09A00001         09      00001        BTG          A |
>>   2. | BTG10A00001         10      00001        BTG          A |
>>   3. | BGM09A00027         09      00027        BGM          A |
>>   4. | BGM10A00027         10      00027        BGM          A |
>>      +---------------------------------------------------------+
>>
>> That's split the identifiers into alphabetic and numeric sequences. I
>> took your examples literally in producing these commands. In your
>> case, you don't care about the result of -a_match2- but I left in
>> above to show that -moss- can split out two or more components, not
>> just one as is typically of calls to -substr()-.
>>
>> That said, Stata makes it easier to create identifiers that will work
>> well across Stata's commands. Do-it-yourself identifiers can just make
>> tables and graphs unwieldy.
>>
>> For a 2007 review, see
>>
>> http://www.stata-journal.com/article.html?article=dm0034
>>
>> The .pdf for that is accessible to all at
>>
>> http://www.stata-journal.com/sjpdf.html?articlenum=dm0034
>>
>> Nick
>>
>> On Thu, Feb 7, 2013 at 1:37 PM, Michler, Jeffrey D <jmichler@purdue.edu> wrote:
>>
>>> I have a dataset which includes household ID variables in an alphanumeric format. The letters are abbreviations of the village a household comes from.  In addition to being in an alphanumeric format, the HH ID has a year element so that the HH ID for 2010 is slightly different than it was for 2009.  I am looking to convert the alphanumeric HH id into a unique id for constructing a panel. I need to replace the 3 letter village abbreviations with a 3 digit number plus I need to drop the year id.
>>>
>>> An example may clarify. Right now HH IDs look like BTG09A00001, BTG10A00001, BGM09A00027, BGM10A00027.
>>>
>>> I want to replace the village code (BTG, BGM) with a numerical sequence. I also want to drop the year sequence (09, 10) so that HH ID is consistent for the HH across years, and I want to drop the A, which plays to role in my dataset. Ideally, this would compress the 4 HH ID I gave as examples into just 2 IDs that would look like 10100001 and 10200027.

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index