Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Eric Booth <ebooth@ppri.tamu.edu> |

To |
"<statalist@hsphsun2.harvard.edu>" <statalist@hsphsun2.harvard.edu> |

Subject |
Re: st: Extract a letter between numbers |

Date |
Mon, 22 Nov 2010 20:16:58 +0000 |

<> On Nov 22, 2010, at 11:59 AM, Nick Cox wrote: > This complements mine in so far as I hinted that there might be an regex solution. But why assume that typos in the number field are limited to a-zA-Z? They might as well be almost anything! > Nick I hadn't seen Nick's posting when I posted, but Nick rightly points out that other characters could be an issue & so his is a better solution for safeguarding that only [0-9] makes it into the street number. I was checking for only alpha chars because that's what the OP described in the initial post ( I guess I assumed the online form the OP uses to capture the online data has some basic validation properties that prevents special characters (e.g., non-[0-9a-zA-Z] chars) from being entered ). I was trying to show how to get rid of the "e" in "12e3" using regular expression matching--which I'm not too experienced with, but I'm trying to learn--so, if someone has a solution using regular expressions to solve the OP's issue, I'd be interested in seeing it. Below is a modification of my original example that gets closer using regex matching & Nick's -charlist- (from SSC) however, it fails if there is an address where there are letters or special characters that are in several places spread throughout the numbers in the street number (e.g. "12a3@4c5"). That is, I'm curious about how to extract the "3" and "4" out of the middle of "12a3@4c5" using regular expressions. There are ways to specify that the regular expression look at the beginning (^) or end ($) of a string, but how do I get things from the middle (or is there a better approach entirely)? *******! clear inp str40(address) "12+3 Main St" "1144Re=^&5 Oak St 77844" "1a Broadway Ave., College Station, TX." "11 Test St." "12a3@4c5 Test St." end //install charlist from SSC// cap which charlist if _rc ssc install charlist, replace //use charlist to grab special chars// charlist address local x `r(sepchars)' numlist "0/9" loc y `c(alpha)' `c(ALPHA)' `r(numlist)' loc z:list local(x) - local(y) loc z:subinstr local z " " "", all ** ** g address2 = regexs(0) if /* */ regexm(address, "^[0-9a-zA-Z/`z']*") g begin = regexs(0) if /* */ regexm(address2, "^[0-9]*") g end = regexs(0) if /* */ regexm(address2, "[0-9]*$") /* */ & address2!=begin //put it together// g newaddress = begin + end li newaddress address *******! - Eric __ Eric A. Booth Public Policy Research Institute Texas A&M University ebooth@ppri.tamu.edu Office: +979.845.6754 * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**References**:**st: Extract a letter between numbers***From:*Patrick McNamara <patrick.mcnamara@efficiency20.com>

**Re: st: Extract a letter between numbers***From:*Eric Booth <ebooth@ppri.tamu.edu>

**RE: st: Extract a letter between numbers***From:*Nick Cox <n.j.cox@durham.ac.uk>

- Prev by Date:
**Re: st: Constructing a variable from standard deviations** - Next by Date:
**re: st: BBEdit and Stata** - Previous by thread:
**Re: st: Extract a letter between numbers** - Next by thread:
**st: exporting data with labels attached** - Index(es):