Person name extraction [STB-13: dm13] ---------------------- ^extrname^ varname [if expr] [in range] [, ^all^ ^p^refix^(^newvar1^)^ ^f^irst^(^newvar2^) m^iddle^(^newvar3^)^ ^l^ast^(^newvar4^)^ ^s^uffix^(^newvar5^) af^fil^(^newvar6^)^ ^o^dd^(^newvar7^)^ ] Description ----------- ^extrname^ attempts to extract American-style person names into new variables corresponding to titles, first, middle, and last name. At least one option must be specified and, generally, you will want to specify all of them; ^all^ provides a convenient way to do this. Options ------- ^all^ is equivalent to specifying "^prefix(prefix) first(first) middle(middle)^ ^last(last) suffix(suffix) affil(affil) odd(odd)^". ^prefix(^newvar1^)^ declares that newvar1 is to contain the prefix (Mr., Dr., etc.) of the extracted name. ^first(^newvar2^)^ declares that newvar2 is to contain the first name or initial. ^middle(^newvar3^)^ declares that newvar3 is to contain the middle name or initial. ^last(^newvar4^)^ declares that newvar4 is to contain the last name. ^suffix(^newvar5^)^ declares that newvar5 is to contain the suffix (such as Jr., Sr., III, etc.) Options, concluded ------------------ ^affil(^newvar6^)^ declares that newvar6 is to contain the affiliations (such as M.D., Esq., Ph.D., etc.). ^odd(^newvar7^)^ declares that newvar7 is to contain nonzero values where ^extrname^ believes it had problems extracting the name. See Codes below for a list of the values and their meaning. IT IS STRONGLY ADVISED THAT YOU SPECIFY THIS OPTION IF YOU DO NOT SPECIFY ^all^. Remarks ------- A name is defined as consisting of parts called the prefix, first, middle, last, suffix, and affil. For instance: |pre- | | | | suf-| af- name |fix | first | mid| last | fix | fil | --------------------------------------------------------------------- Smith | | | | Smith | | | Roger Smith | | Roger | | Smith | | | Roger A. Smith | | Roger | A. | Smith | | | Roger A. Smith, Jr. | | Roger | A. | Smith | Jr. | | Dr. Roger A. Smith, Jr. | Dr. | Roger | A. | Smith | Jr. | | Dr. Roger A. Smith, Jr., MD | Dr. | Roger | A. | Smith | Jr. | M.D. | ^extrname^ attempts to extract names of the form: [first [middle]] last last [, first [middle]] Remarks, continued ------------------ For instance, ^extrname^ would understand any of the following as well as the above: Smith, Roger Smith, Roger A. Dr. Smith, Roger A. Smith, Dr. Roger A. In general, however, ^extrname^ will not understand: last [first [middle]] (That is, last name first with no comma between the last and the first name; "Smith Roger" would be interpreted as first name Smith and last name Roger.) We say in general because there are some cases where, even with the omitted comma, ^extrname^ will be able to determine that the last name came first (as in "Smith Roger A." where the hanging middle initial makes clear that Smith is the last name). Remarks, continued ------------------ The point of ^extrname^ is to divide names into their components even when the name is "messy". For instance, ^extrname^ will properly process: input resulting last, first middle -------------------------------------------------------- Smith R "Smith", "R" "" Roger St. Craig "St. Craig", "Roger" "" John Mc Call "McCall", "John" AB Smith "Smith", "A." "B." NY YON "Yon", "Ny" "" B van Hooser "Van Hooser", "B." "" A van der sleuss "Van Der Sleuss", "A." "" D de Bolt "De Bolt", "D." "" T de la Rosa "De La Rosa", "T." "" The input names do not have to be of mixed case ("SMITH R" would be okay) but, if they are, the casing information will be exploited (e.g., MR SMITH could be Mr. Smith or M. R. Smith and will be interpreted as the latter by ^extrname^, but "Mr Smith" would be correctly interpreted as Mr. Smith). Prefix ------ The resulting ^prefix()^ can contain: Mr. Ms. Miss Mrs. Dr. Prof. Prof. Dr. drs. Sgt. Lt. Lt. Cmdr. Cmdr. Cap. Lt. Col. Col. Gen. Suffix ------ The resulting ^suffix()^ can contain: Jr. Sr. II III IV Affiliation ----------- The resulting ^affil()^ can contain: M.D. Ph.D. Esq. Codes ----- In addition to extracting the name, ^odd()^ returns a flag indicating whether ^extrname^ thought it had problems. This flag is only suggestive; just because ^extrname^ does not think it has problems does not mean it does not (e.g., in- put "SMITH ROBERT" resulting in last name Robert). Similarly, the "problem" cases may not be problems at all. ^odd()^ contains: -1 There are multiple problems (listed below). 0 No problems were flagged, which does not mean that there are no problems. 11 First name may be first initial and middle initial. The first name has two letters and the middle initial is blank. This was interpreted as a first name, however, because it contains vowels and, among all the 2-letter names found, the consonant-only names were less than 60% of the sample. Codes, continued ---------------- 21 First and middle initials may need to be combined to form real first names. An apparently 2-letter first name that contained vowels was treated as a first and second initial because, among all the 2-letter names found, the consonant-only names were more than 59% of the sample. 101 The first name contains embedded blanks. 102 The middle name contains embedded blanks. 103 The last name contains embedded blanks. 111 The first name contains periods in odd places. 112 The second name contains periods in odd places. 113 The last name contains periods. Examples -------- . ^extrname name, all^ creates new variables ^prefix^, ^first^, ^second^, ^last^, ^suffix^, ^affil^, and ^odd^. . ^extrname name, first(fname) second(mname) last(lname) odd(prob)^ creates new variables ^fname^, ^mname^, ^lname^, and ^prob^. ^extrname^ goes through the same logic as it normally would, but at the completion, the prefix, suffix, and affiliation are discarded. Execution speed --------------- ^extrname^ is slow. The following timings were recorded on a SPARCstation IPC: n secs predicted ------------------------- 91 24.31 24.20 182 28.03 27.92 364 35.28 35.37 728 50.20 50.25 1,456 79.95 80.03 2,912 139.70 139.58 Predicted based on: secs = 20.48 + .0409*n Thus, on the IPC, after the 20.48 second setup time, ^extrname^ processes approximately 1/.0409=24.45 names per second. Request for improvements ------------------------ During development, ^extrname^ was tested on a (messy) data set of 61,080 dif- ferent names. Nevertheless, there are probably many names that ^extrname^ will process incorrectly. This can be because it is logically impossible, based on the input, to know what correctly is (e.g., "MR ROGERS") or because ^extrname^ is not sufficiently sophisticated. If you encounter a name that is not of the form last-name-first-with-missing- comma that ^extrname^ should be able to correctly extract but does not, please fax it to: William Gould Computing Resource Center 310-393-7551 Also see -------- STB: dm13 (STB-13) On-line: ^help^ for ^replstr^