Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: problem with regexm leading to "regexp: unterminated ()" error for all observations

From	Jamie Fagg <[email protected]>
To	[email protected]
Subject	Re: st: problem with regexm leading to "regexp: unterminated ()" error for all observations
Date	Mon, 06 Jun 2011 15:11:55 +0100

Dear Phil Schumm, Nick Cox and Steve Samuels,

Many thanks for all your help on this.

Steve and Nick - thanks for the initial advice on what was causing theerror.

Phil - I had just finished breaking it down when I saw your message. Itis a much more elegant solution than the one I came up with after Nickrecommended breaking it down, so thanks.


Best wishes,

Jamie

On 03/06/2011 18:10, Phil Schumm wrote:

On Jun 3, 2011, at 7:35 AM, Jamie Fagg wrote:

I've a problem with the function -regexm-. I get the following message:

regexp: unterminated ()


<snip>

#delimit ;

//regular expression to define whether postcode is syntactically correct

ge postcodevalid = 1 if regexm(postcode,"(GIR0AA)|(((A[BL]|B[ABDHLNRSTX]

?|C[ABFHMORTVW]|D[ADEGHLNTY]|E[HNX]?|F[KY]|G[LUY]?|H[ADGPRSUX]
|I[GMPV]|JE|K[ATWY]|L[ADELNSU]?|M[EKL]?|N[EGNPRW]?|O[LX]|P[AEHLOR]
|R[GHM]|S[AEGKLMNOPRSTY]?|T[ADFNQRSW]|UB|W[ADFNRSV]|YO|ZE)[1-9]?[0-9]
|((E|N|NW|SE|SW|W)1|EC[1-4]|WC[12])[A-HJKMNPR-Y]|(SW|W)([2-9]|[1-9]
[0-9])|EC[1-9][0-9]) [0-9][ABD-HJLNP-UW-Z]{2})")==1;

I'm not sure why Stata chokes on this, though I would suspect it mighthave something to do with the length. As Nick and Steven have alreadynoted, the repeat qualifier {n} is not supported by Stata's regularexpression syntax, so you'll need to replace



    [ABD-HJLNP-UW-Z]{2}


with the equivalent


    [ABD-HJLNP-UW-Z][ABD-HJLNP-UW-Z]

Now, Nick suggested breaking the expression up, so let's do that.Your expression is equal to



    (p1)|(((p2a1a|p2a1b|p2a1c)p2a1d|p2a2|p2a3|p2a4)p2b)


where the individual parts (as assigned to Stata macros) are


    loc p1    "GIR 0AA"

    loc p2a1d "[1-9]?[0-9]"
    loc p2a2  "((E|N|NW|SE|SW|W)1|EC[1-4]|WC[12])[A-HJKMNPR-Y]"
    loc p2a3  "(SW|W)([2-9]|[1-9][0-9])"
    loc p2a4  "EC[1-9][0-9]"
    loc p2b   " [0-9][ABD-HJLNP-UW-Z][ABD-HJLNP-UW-Z]"


This may then be easily broken up as follows:


    gen byte valid = regexm(postcode,"`p1'")
    replace valid = 1 if regexm(postcode,"`p2a1a'`p2a1d'`p2b'")
    replace valid = 1 if regexm(postcode,"`p2a1b'`p2a1d'`p2b'")
    replace valid = 1 if regexm(postcode,"`p2a1c'`p2a1d'`p2b'")
    replace valid = 1 if regexm(postcode,"(`p2a2'|`p2a3'|`p2a4')`p2b'")


-- Phil

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


--
MRC Centre of Epidemiology for Child Health
UCL Institute of Child Health
30 Guilford Street
London, WC1N 1EH

Tel - 0207 905 2320

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

References:
- st: problem with regexm leading to "regexp: unterminated ()" error for all observations
  - From: Jamie Fagg <[email protected]>
- Re: st: problem with regexm leading to "regexp: unterminated ()" error for all observations
  - From: Phil Schumm <[email protected]>

Prev by Date: st: RE: Difficult wide file
Next by Date: RE: st: Elimination of outliers
Previous by thread: Re: st: problem with regexm leading to "regexp: unterminated ()" error for all observations
Next by thread: st: Register Now for Introduction to Stata for Medical Statistics Course
Index(es):
- Date
- Thread