Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: st: How to get rid of leading and trailing letters and symbols?


From   Nick Cox <n.j.cox@durham.ac.uk>
To   "'statalist@hsphsun2.harvard.edu'" <statalist@hsphsun2.harvard.edu>
Subject   RE: st: How to get rid of leading and trailing letters and symbols?
Date   Wed, 26 Oct 2011 13:38:44 +0100

I agree with Uli in recommending regular expression machinery. Given these data, 

. l

     +-------------------------------------+
     |                             example |
     |-------------------------------------|
  1. |                /profile/?id=9596986 |
  2. | /profile/?id=9591886&reftype=detail |
     +-------------------------------------+

-moss- (SSC) is, as mentioned very recently on this list, a wrapper for Stata's regex functions. It can give you more output than you need, but you just discard what you don't want. This finds numbers based on digits 0-9:  

. moss example, match(([0-9]+)) regex

. l

     +----------------------------------------------------------------+
     |                             example   _count   _match1   _pos1 |
     |----------------------------------------------------------------|
  1. |                /profile/?id=9596986        1   9596986      14 |
  2. | /profile/?id=9591886&reftype=detail        1   9591886      14 |
     +----------------------------------------------------------------+

and there are all sorts of ways of subdividing according to position, with or without regular expressions. A criterion for number at the end is that the last character of the string is numeric which is 

. gen atend = !missing(real(substr(example,-1,1)))

. l

     +-----------------------------------------------------------------------------------+
     |                             example   number~d   _count   _match1   _pos1   atend |
     |-----------------------------------------------------------------------------------|
  1. |                /profile/?id=9596986    9596986        1   9596986      14       1 |
  2. | /profile/?id=9591886&reftype=detail                   1   9591886      14       0 |
     +-----------------------------------------------------------------------------------+


Nick 
n.j.cox@durham.ac.uk 

Ulrich Kohler

you should get that using regular expressions (see help regexp). I don't
use regular expression very often in Stata, but in my favourite Editor,
Emacs, the regular expression to find a number of arbitrary length
would be 

\(\[0-9]+\)

which would store the number in \1. The Stata regular expression should
work very similar. 


Am Mittwoch, den 26.10.2011, 10:37 +0100 schrieb Ekaterina Hertog:

> I have got a dataset where the id variable is a part of a web-link. It 
> can contain letters followed by the id number: (e.g. 
> /profile/?id=9596986) or it can contain the id number in the middle 
> (e.g. /profile/?id=9591886&reftype=detail). I need to create a variable 
> which will only contain the number that is part of the id variable. I 
> also need to be able to distinguish between the cases where the number 
> is trailing vs. cases where it is in the middle. I looked at the advice 
> available on removing leading or trailing 0s in Stata 11 
> (http://www.stata.com/support/faqs/data/leadingzeros.html), but in my 
> case I cannot actually specify the letters and symbols that lead or 
> trail so I am stuck. I use Stata 11.

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index