Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | Patrick McNamara <patrick.mcnamara@efficiency20.com> |
To | statalist@hsphsun2.harvard.edu |
Subject | Re: st: Extract a letter between numbers |
Date | Mon, 22 Nov 2010 14:54:36 -0500 |
That's a good point, as I have a few who put hyphens in place of numbers. To be honest, I'm not sure I understand how to implement the answers that either of you presented; I think I understated how new I am to coding in stata :) The more pressing issue for me may be identifying where the actual street names start and end; being that they can be letters or numbers. I've split the addresses out using the basic split function, and now have up to 13 variables. The method doesn't have to be perfect (meaning I can lose a few of the crazier ones and it won't be a big deal), but the street address is usually within three different variables. To step back, the ultimate goal here is to match up street addresses people put in on a website with the standardized versions in my database, which have the house number, direction (N, NW, etc.), street name, street suffix (st, st., ave, pl., etc.) as well as city, zip and state (state is all Illinois). Any thoughts on this? On Mon, Nov 22, 2010 at 12:59 PM, Nick Cox <n.j.cox@durham.ac.uk> wrote: > This complements mine in so far as I hinted that there might be an regex solution. But why assume that typos in the number field are limited to a-zA-Z? They might as well be almost anything! > > Nick > n.j.cox@durham.ac.uk > > Eric Booth > > Probably need to take a look at regular expression matching. > Take a look at these links: > > http://www.stata.com/support/faqs/data/regex.html > http://www.stata.com/meeting/wcsug07/medeiros_reg_ex.pdf > > Here's a start: > ********! > clear > inp str40(address) > "12e3 Main St" > "1144Re5 Oak St 77844" > "1a Broadway Ave., College Station, TX." > "11 Test St." > end > > gen address2 = regexs(0) if /* > */ regexm(address, "^[0-9a-zA-Z]*") > destring address2, replace force ignore("`c(alpha)'`c(ALPHA)'") > li > ********! > > On Nov 22, 2010, at 11:07 AM, Patrick McNamara wrote: > >> I'm new to stata coding (been using drop-down menus for a few years), >> and I'm working on an address parser to pull apart and put back >> together people's real address apart from the mess they enter online >> :) Right now I'm trying to figure out a way to take out any letters in >> between two numbers that people have accidentally typed into their >> house address field (i.e. for 123 Main St, they types 12e3 Main St). >> The letters are not in the same position and there are multiples. I've >> tried strpos() but it won't allow me to use a range [A-Z] or [0-9]. >> Any help would be greatly appreciated! > > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/statalist/faq > * http://www.ats.ucla.edu/stat/stata/ > -- ___________________________ Patrick McNamara Manager, Program Logistics Efficiency 2.0 165 William Street, Floor 10 New York, NY 10038 T. 646 478 8509 M. 816 305 5679 F. 347 328 9342 patrick.mcnamara@efficiency20.com efficiency20.com This electronic message originates from Efficiency 2.0, LLC. The information contained in this message may be legally privileged and confidential under applicable law. If you are not the intended recipient you are hereby notified that any dissemination, copy or disclosure of this communication is strictly prohibited. If you have received this communication in error, please notify the sender and purge the communication immediately without making any copy or distribution. Please consider the environment before printing this email. * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/