{smcl} {center:{bf:Fun and fluency with functions}} {center:{bf:Nicholas J. Cox, Durham University}} {center:{bf:n.j.cox@durham.ac.uk}} 1 What are functions? Functions, strict sense, take zero or more arguments and return single results. They are documented in {cmd:[D] functions} and at {help functions}. Examples are {cmd:runiform()}, {cmd:ln(42)}, {cmd:strpos("Stata", "S")}. The {cmd:()} syntax is generic. Think if you like of an open mouth expecting to be fed. There are functions you know you need and functions you don't know you want. The aim of this talk is to tell you more about the latter than the former. It includes only some highlights, not the complete tour. If you need date, trigonometric, hyperbolic, gamma, density functions, and so forth, it is mostly just a matter of finding out the syntax. Henceforth, {it:SJ} = Stata Journal, {it:STB} = Stata Technical Bulletin, {it:SSC} = SSC archive. A first overview reference is {it:SJ} 2(4):411{c -}427 (2002). 2 A request for new users Note for new Stata users: Stata commands are not considered to be functions. {cmd:list} and {cmd:regress} are to be called Stata commands, not functions, regardless of terminology elsewhere. That's a polite request, not a command. 3 The big picture Functions can be more useful than you think. Users often seek commands or imagine they need programs when a few functions will crack the problem. Arguments and results of functions can be variables, calculated observation by observation. This is sometimes overlooked: people plan loops over observations when {cmd:generate} or {cmd:replace} will do that automatically. Stata is not knee-deep in functions. The intent is rather to provide a core of really useful functions. Often you need to combine functions to get what you want. This simplicity is really a feature! Stata functions, strict sense, can not be written by users. You can write functions in Mata. And you can write {help egen} functions. Functions can not be called by themselves. You can use commands to assign or display their results. 4 Fuzzy boundaries The definition of functions is clear, but some extensions seem in order for this talk. {help egen} will get some attention. {cmd:egen} has its own sense of functions. We will also touch on extended macro functions c-class results {cmd:_n} and {cmd:_N} as "pseudo-functions" Mata functions and e-class and r-class results will not be touched upon, except in passing. Covering those big topics is for other talks or courses. 5 _variables Let us deal briskly with {cmd:_n} and {cmd:_N}. {cmd:_n} is the current observation number. {cmd:_N} is the total number of observations in the dataset. So, don't use any command to try to find this out. Under {cmd:by:} {cmd:_n} and {cmd:_N} are corresponding counts within groups defined by {cmd:by:}. This is often a quick and easy way to get counts into variables. {cmd:. bysort id : gen npanel = _N} If {help by:by:} is unfamiliar, there is a tutorial at {it:SJ} 2(1):86{c -}102 (2002). Don't forget {cmd:_pi} for 3.14159... 6 display and graph are your friends If you are unclear what a function does, use {help display} on a few examples. {cmd:. di log(10)} {cmd:2.3025851} {cmd:. di sqr(10)} {cmd:Unknown function sqr()} {cmd:r(133);} {cmd:. di sqrt(10)} {cmd:3.1622777} {help twoway function} shows graphs of functions: {it:SJ} 4(4):488{c -}489 (2004). {cmd:. twoway function clip(x,0.2,0.8)} (ramp) {cmd:. twoway function chop(x,1), ra(0 10)} (staircase) {cmd:. twoway function sin(x)/x , ra(0 `=20 * _pi')} (damped sine) 7 Strategies and tactics {it:Divide and conquer.} Often you need one function used repeatedly, or two or more functions working together. Hypergeometric probabilities: {cmd:comb(r, k) * comb(n-r, m-k) / comb(n, m)} This is particularly common with string problems. {it:Nest function calls.} Feed the results of one function to another, e.g. {cmd:exp(rnormal())}. So, cut down on middle macros or middle variables. Beta function: {cmd:exp(lngamma(a) + lngamma(b) - lngamma(a + b))} {it:Help yourself.} Spaces after commas and around operators often make your code more readable. {cmd:help fname()} gets you straight to help for {cmd:fname()}, when you know it. 8 Numbers trapped in strings, and vice versa {help real()} and {help string()} are the workhorse functions when you know you want to change from one form to the other. Always remember that {cmd:string()} can take a format argument too. Some users employ {help destring} or {help tostring} with the {cmd:force} option to change single variables. That is {err:backwards}: {cmd:destring} and {cmd:tostring} are just elaborate wrappers for those functions. If you know you want to force the conversion, you should call the appropriate function directly. 9 Depending on conditions {cmd:cond(}{it:a}{cmd:,} {it:b}{cmd:,} {it:c}{cmd:)} returns {it:b} if {it:a} is true (non-zero) and {it:c} if {it:a} is false (zero). Results can be either numeric or string. See {it:SJ} 5(3):413{c -}420 (2005) for a tutorial. Is year a leap year? {cmd:cond(mod(year, 400) == 0, 1,} {cmd:cond(mod(year, 100) == 0, 0,} {cmd:cond(mod(year, 4) == 0, 1,} {cmd:0)))} If it's at all complicated, get into a text editor that checks matching parentheses. 10 in list or in range? {it:SJ} 6(4):593{c -}595 (2006) {cmd:inlist(}{it:z}{cmd:,} {it:a}{cmd:,} {it:b}{cmd:,}...{cmd:)} returns 0 or 1 1 if {it:z} {cmd:==} {it:a} {cmd:|} {it:z} {cmd:==} {it:b} {cmd:|} ... 0 otherwise Note: limits on number of arguments. {cmd:inlist(rep78, 3, 4, 5)} {cmd:inlist(1, a, b, c, d, e)} !!! {cmd:inrange(}{it:z}{cmd:,} {it:a}{cmd:,} {it:b}{cmd:)} returns 0 or 1 1 if {it:a} {cmd:<=} {it:z} {cmd:<=} {it:b} 0 otherwise special rules for missing arguments {cmd:inrange(}{it:integer}{cmd:, 0, 9)} {cmd:inrange(}{it:char}{cmd:, "a", "z")} {cmd:inrange(}{it:char}{cmd:, "A", "Z")} Note: {cmd:if !inlist()} and {cmd:if !inrange()} (think: if not in list, etc.) 11 Anything missing? With a single numeric variable, {cmd:if x < .} is the cleanest way to exclude missings: see P.A. Lachenbruch {it:STB} 9:9 (1992). {cmd:if x != .} is fine so long as {cmd:.a} ... {cmd:.z} are not around. For more complicated problems, turn to {help missing()}. See B. Rising {it:SJ} 10(2):303{c -}304 (2010). {cmd:missing(}{it:x1}{cmd:,} {it:x2}{cmd:,} ..., {it:xn}{cmd:)} returns 1 if any of its arguments is missing and 0 otherwise. It applies to numeric and string arguments. {cmd:!missing()} reverses results. A statistical short-cut is to use {cmd:regress} and then {cmd:e(sample)}. 12 The search for the absolute {cmd:abs(}{it:x}{cmd:)} is often overlooked when |{it:x}| is wanted. People write {cmd:if t > 2 | t < -2} when they could write more cleanly, and with less risk of error, {cmd:if abs(t) > 2} {cmd:sign(}{it:x}{cmd:)} returns -1, 0, 1, missing for negative, zero, positive, missing arguments. {cmd:logit(}{it:x}{cmd:)} and {cmd:invlogit(}{it:x}{cmd:)} are often overlooked too. logit({it:x}) = ln[{it:x} / (1 - {it:x})] invlogit({it:x}) = exp({it:x}) / [1 + exp({it:x})] 13 Rounding up and down {cmd:ceil(}{it:x}{cmd:)} rounds up always. Think "ceiling". {cmd:ceil(1.2)} is 2. {cmd:ceil(10 * runiform())} is a neat way to get uniformly distributed integers 1(1)10. Compare {cmd:1 + int(10 * runiform())}. Apply "for any value of 10". {cmd:floor(}{it:x}{cmd:)} rounds down always. {cmd:floor(1.8)} is 1. {cmd:if x == floor(x)} is one of several tests of whether {cmd:x} is integer. {cmd:int(}{it:x}{cmd:)} rounds towards zero always, i.e. up for negative numbers. {cmd:int(1.2)} is 1. {cmd:int(-1.2)} is -1. {cmd:trunc(}{it:x}{cmd:)} is a synonym for {cmd:int(}{it:x}{cmd:)}. {cmd:round(}{it:x}{cmd:)} rounds to the nearest integer. {cmd:round(}{it:x}{cmd:,} {it:y}{cmd:)} rounds to the nearest multiple of {it:y}. {err:Warning} {cmd:round(1.23, 0.1)} may look like 1.2, but it cannot be 1.2 exactly. If you want to show so many decimal places, use the right format, not {cmd:round()}. To find more: {stata search precision}. See also {it:SJ} 3(4):446{c -}447 (2003) or W. Gould (passim) on precision. 14 Bin it! {cmd:autocode()}, {cmd:irecode()}, {cmd:recode()} are functions for binning a range into contiguous intervals. See also {cmd:egen}'s {cmd:cut()} function. Some people use {cmd:recode} for binning continuous variables. Over-arching caution: you should check boundary rules carefully. {cmd:2 * floor(myvar/2)} illustrates a simpler device. This gives bins of width 2 such as [0,2), i.e. lower limits are inclusive. {cmd:2 * ceil(myvar/2)} is an alternative. This gives bins of width 2 such as (0,2], i.e. upper limits are inclusive. 15 Going to extremes {cmd:max(}{it:x1}{cmd:,} {it:x2}{cmd:,} ..., {it:xn}{cmd:)} {cmd:min(}{it:x1}{cmd:,} {it:x2}{cmd:,} ..., {it:xn}{cmd:)} A key detail is that missings are ignored unless all arguments are missing. This is usually a feature. There are work-arounds when you want any missing to trump non-missings. One is {cmd:cond(missing(a, b), ., max(a, b))} If you have lots of arguments, use a loop to get the m??imum. If you have a variable, use {cmd:summarize, meanonly} to get the m??imum. What about records, such as the maximum so far, and within panels too? {cmd:. gen record = .} {cmd:. bysort id (time) : replace record = max(record[_n-1], y)} 16 Any or all? If {it:arg} is 1 (true) or 0 (false), then {cmd:min(}{it:arg}{cmd:) == 0} some false {cmd:min(}{it:arg}{cmd:) == 1} all true {cmd:max(}{it:arg}{cmd:) == 1} some true {cmd:max(}{it:arg}{cmd:) == 0} all false which are often useful. Using {cmd:egen}'s {cmd:min()} or {cmd:max()}, typically in conjunction with {cmd:by:} or {cmd:by()}, summarises this by panels or other groups (e.g. families). See {browse "http://www.stata.com/support/faqs/data/anyall.html":http://www.stata.com/support/faqs/data/anyall.html} 17 Modulus the maestro {cmd:mod(}{it:x}{cmd:,} {it:y)} is a very versatile function. By a standard abuse of terminology, {cmd:mod()} returns the remainder (not the modulus as mathematicians know it). Such names have been common in programming since the 1950s. {cmd:mod()} received its own small song of praise in {it:SJ} 7(1):143{c -}145 (2007). That article omitted rotations, e.g. {cmd:mod(angle + 90, 360)} Is the observation number odd? {cmd:if mod(_n, 2) == 1} or {cmd:if mod(_n, 2)} 18 Some sums Three key points: {cmd:sum(}{it:x}{cmd:)} returns cumulative sums. {cmd:sum(}{it:x}{cmd:)} ignores missings (and returns 0 if all missing). {cmd:egen}'s {cmd:total()} function is more direct {c -} but less efficient {c -} to put group totals in a variable. Make sure you know this two-step: {cmd:. bysort id : gen mysum = sum(myvar)} {cmd:. by id : replace mysum = mysum[_N]} After the first command, the last observation in each group contains its group total. This is what {cmd:egen}'s {cmd:total()} does too. How many distinct values of {cmd:x} have been seen so far within panels? {cmd:. bysort id x (time) : gen distinct = _n == 1} {cmd:. bysort id (time) : replace distinct = sum(distinct)} 19 Stringing along Many data management questions {c -} including clean-ups of data {c -} need string functions, often in combination. Note a key principle at the outset: Stata's string operations are utterly literal. Moral: many tests should be phrased in terms of one case and/or consistent leading, trailing and internal spaces. You can do that in one with say {cmd:lower(trim(itrim(myvar)))} Remember: just as in algebra, every left parenthesis {cmd:(} is a promise to write down its match {cmd:)} sooner or later. (Notation is a branch of etiquette, not of logic.) My top four string functions {c -} which everyone should know {c -} are {cmd:strpos()}, {cmd:substr()}, {cmd:subinstr()} and {cmd:length().} 20 The best string quartet this side of Vienna: violins {cmd:strpos(}{it:s1}, {it:s2}) tells you where {it:s2} occurs in {it:s1}, and 0 if it does not occur. Think {cmd:str}ing {cmd:pos}ition. Corollary: if {cmd:strpos}({it:s1}{cmd:,} {it:s2}{cmd:) > 0} or {cmd:if strpos(}{it:s1}{cmd:,} {it:s2}{cmd:)} is a true-or-false test of whether {it:s1} contains {it:s2}. {cmd:strpos("this", "is")} is 3 {cmd:strpos("this", "it")} is 0 {cmd:strpos("haystack", "needle")} is 0 In older versions of Stata, this function was called {cmd:index()}. Commonly, {it:s1} is a string variable name. Less commonly, {it:s2} is a string variable name. {cmd:substr(}{it:s}{cmd:,} {it:pos}{cmd:,} {it:len}{cmd:)} gives the substring of {it:s} starting at position {it:pos} and of length {it:len}. {cmd:pos} can be < 0 indicating position counted backwards from end. {cmd:len} can be . indicating everything else. {cmd:substr("abcdef", 2, 3)} is "bcd" {cmd:substr("abcdef", -3, 2)} is "de" {cmd:substr("abcdef", 2, .)} is "bcdef" 21 The best string quartet this side of Vienna: viola and cello {cmd:subinstr(}{it:s1}{cmd:,} {it:s2}{cmd:,} {it:s3}{cmd:,} {it:n}{cmd:)} changes the first {it:n} occurrences in {it:s1} of {it:s2} to {it:s3}. Note: if {it:s3} is {cmd:""}, occurrences are deleted. This is the zap function! {cmd:subinstr("this is this", "is", "X", 1)} is "thX is this" {cmd:subinstr("this is this", "is", "X", 2)} is "thX X this" {cmd:subinstr("this is this", "is", "X", .)} is "thX X thX" {cmd:length(}{it:s}{cmd:)} returns the length of {it:s}. Remember: {it:s} can be a string variable name. You get the length of its contents. {cmd:length("ab")} is 2 {cmd:length(myvar)} could vary by observation 22 More advanced string theory {cmd:char(}{it:n}{cmd:)} returns ASCII character {it:n}, so is one way of displaying otherwise unprintable characters. See {it:SJ} 4(1):95{c -}96 (2004). For a convenient display, download {cmd:asciiplot} from {help ssc:SSC}. Stata has a suite of regular expression functions: {cmd:regexm()}, {cmd:regexr()}, {cmd:regexs()}, {cmd:strmatch()}. They are not very well documented, but start with {browse "http://www.stata.com/support/faqs/data-management/regular-expressions/":http://www.stata.com/support/faqs/data-management/regular-expressions/} However, people often dive into regex when the quartet would suffice. For one wrapper for regex to extract multiple occurrences of substrings, download {cmd:moss} from SSC. To work backwards, e.g. to change the last occurrence of a substring consider reversing and finally reversing back with {cmd:reverse(}...{cmd:reverse(}...{cmd:)}...{cmd:)} 23 Counting occurrences of substrings Let's switch to a problem rather than a function: counting (disjoint) occurrences of substrings ({it:SJ} 11(2) 318{c -}320). For example: count how many {cmd:"X"}s there are in {cmd:"OOOOXXXOOXXX"}. Many users store short histories (244 periods or less) in string variables. Here is one way: {cmd:length(myvar) - length(subinstr(myvar, "X", "", .))} Take this in steps: Get the length of {cmd:myvar}. Get the length of {cmd:myvar} with all {cmd:"X"} zapped. (You don't have to do that, just find out what the length would be.) The difference is what you want. This generalises to longer substrings: just remember also to divide by length of substring, as you want to count occurrences. Generalisation to regular expressions is trickier: see {cmd:moss} (SSC). Remember that some operations on substrings are easier after {help split}. 24 Removing the first word There's more than one way to do it. Words are separated by spaces. So, look for the first space. {cmd:trim(substr(myvar, strpos(myvar, " "), .))} This works, perhaps fortuitously, if there is no space present, as {cmd:strpos()} then returns 0 and {cmd:substr()} then returns {cmd:""}. Use a dedicated function. {cmd:word()} selects individual words. {cmd:trim(subinstr(myvar, word(myvar, 1), "", 1))} In both cases, we applied {cmd:trim()} last. Also, {cmd:egen}'s {cmd:ends()} function can do this with its {cmd:tail} option. 25 Cleaning up species names (binominals) The proper form is that genus is capitalised, but not species: {it:Homo sapiens}, {it:Homo economicus}, {it:Troglodytes troglodytes}. {cmd:. gen species2 = upper(substr(species, 1, 1)) +} {cmd:lower(substr(species, 2, .))} The function {cmd:proper()} capitalises each word, not what we want. Suppose we want the first two words only, ignoring detail like "(Linnaeus, 1758)". {cmd:. replace species2 = word(species2, 1) + " " + word(species2, 2)} 26 Filler apps Stata lacks a function quite like rep("X", 80) to replicate strings. {cmd:local text : di _dup(80) "X"} {cmd:mata : st_local("text", 80 * "X")} are two ways to do it. With a supply of filler, you can add as much as you want with {cmd:substr("`text'", 1, len)} Finally, you could always type it out yourself. 27 Avoid conflicts {cmd:scalar(x)} and {cmd:matrix(X)} insist that you want the scalar named {cmd:x} and the matrix named {cmd:X}, and not any variable with the same name (or the same unambiguous abbreviation). Variables, scalars and matrices share the same namespace, and the variable interpretation always trumps the others. Using {help tempname}s helps avoid the problem in another way. See G.I. Kolev {it:SJ} 6(2):279{c -}280 (2006). 28 c-class citizens There are many more than this list. See help {help creturn}. {cmd:c(current_date)} e.g. "13 Apr 2011" {cmd:c(current_time)} e.g. "09:47:13" {cmd:c(stata_version)} version of Stata {cmd:c(version)} version set by {help version} command {cmd:c(N)} number of observations in dataset {cmd:c(k)} number of variables in dataset {cmd:c(changed)} 0 if dataset not changed since last saved, 1 otherwise {cmd:c(seed)} current set seed setting {cmd:c(pi)} {cmd:_pi} {cmd:c(alpha)} a b c ... z {cmd:c(ALPHA)} A B C ... Z {cmd:c(Mons)} Jan Feb Mar ... Dec {cmd:c(Months)} January February March ... December {cmd:c(Wdays)} Sun Mon Tue Wed Thu Fri Sat {cmd:c(Weekdays)} Sunday Monday Tuesday Wednesday Thursday Friday Saturday Results are often best invoked as {cmd:`c(name)'} rather than {cmd:c(name)}. See also {it:SJ} 4(2):223 (2004). 29 Extended macro functions These are documented at help {help extended_fcn} (if you can't remember that, start at help {help macro}). They are mainly, but not exclusively, useful to programmers. They cover, among many other tasks, automating look-up of variable types, formats, variable and value labels, matrix stuff, constraints, etc.; displaying manipulations on the fly; and macro manipulation that would otherwise be difficult, say because of length limits. {cmd:. local type : type myvar} {cmd:. local format : format myvar} {cmd:. local label : variable label myvar} {cmd:. local text : display %3.1f r(mean)} (as for sensible rounding in graph annotation) 30 egen highlights {help egen} is an unusual command. It is a wrapper for calling its own "functions", but only one at a time. Also, {cmd:egen} functions cannot be used outside {cmd:egen}. {cmd:egen} creates new variables (and nothing else). More positively, {cmd:egen} is a convenient ragbag for functions of less importance, and it provides template code that users can imitate. {cmd:egen} [{it:type}] {it:newvar} = {it:fcn}{cmd:(}{it:arguments}{cmd:)} [{it:if}] [{it:in}] [{cmd:,} {it:options}] {cmd:by:} is allowed with some {cmd:egen} functions (or a {cmd:by()} option, currently undocumented). The arguments can often be expressions, a point often overlooked. {cmd:count(}{it:exp}{cmd:)} counts non-missings {cmd:cut(}{it:varname}{cmd:)} bins variables {cmd:group(}{it:varlist}{cmd:)} a valuable workhorse for assigning identifiers 1 up {cmd:seq()} integer sequences {cmd:tag(}{it:varlist}{cmd:)} tags just one value in a homogeneous group {cmd:max(}{it:exp}{cmd:)} key summary statistics {cmd:mean(}{it:exp}{cmd:)} {cmd:median(}{it:exp}{cmd:)} {cmd:min(}{it:exp}{cmd:)} {cmd:pctile(}{it:exp}{cmd:)} {cmd:rank(}{it:exp}{cmd:)} {cmd:sd(}{it:exp}{cmd:)} {cmd:total(}{it:exp}{cmd:)} {cmd:row}*{cmd:(}{it:varlist}{cmd:)} various row (across observation) operations 31 Properties of the other members of any group A basic recipe is {cmd:. egen total = total(myvar), by(id)} {cmd:. egen count = count(myvar), by(id)} {cmd:. gen meanothers =} {cmd:(total - cond(missing(myvar), 0, myvar)) / (count - !missing(myvar))} See {browse "http://www.stata.com/support/faqs/data-management/creating-variables-recording-properties/":http://www.stata.com/support/faqs/data-management/creating-variables-recording-properties/} 32 Counting distinct values {cmd:. egen tag = tag(id myvar)} {cmd:. egen total = total(tag), by(id)} Some people say "unique" when they mean "distinct". See also {it:SJ} 8(4):557{c -}568 (2008). 33 User-written egen functions {cmd:. findit egen} points to locations. {cmd:egenmore} (SSC) is the largest single collection. 34 Exercise 1. This works in Stata: {cmd:. egen gmean = mean(ln(y)), by(id)} {cmd:. replace gmean = exp(gmean)} 2. This doesn't work in Stata: {cmd:. egen gmean = exp(mean(ln(y)), by(id)} 3. This does work in Mata: {cmd:exp(mean(ln(y)))} Explain. Is Mata easier to learn than Stata? 35 Acknowledgments Helpful comments from Stephen Jenkins Roger Newson