Fun and fluency with functions



  Nicholas J. Cox, Durham University
  [email protected]



  1 What are functions?



  Functions, strict sense, take zero or more arguments and return single
  results. They are documented in [D] functions and at functions.



  Examples are runiform(), ln(42), strpos("Stata", "S").



  The () syntax is generic. Think if you like of an open mouth expecting
  to be fed.



  There are functions you know you need and functions you don't know you
  want. The aim of this talk is to tell you more about the latter than the
  former. It includes only some highlights, not the complete tour.



  If you need date, trigonometric, hyperbolic, gamma, density functions,
  and so forth, it is mostly just a matter of finding out the syntax.



  Henceforth, SJ = Stata Journal, STB = Stata Technical Bulletin,
  SSC = SSC archive.



  A first overview reference is SJ 2(4):411-427 (2002).



  2 A request for new users



  Note for new Stata users:



  Stata commands are not considered to be functions.



  list and regress are to be called Stata commands, not functions,
  regardless of terminology elsewhere.



  That's a polite request, not a command.



  3 The big picture



  Functions can be more useful than you think. Users often seek commands
  or imagine they need programs when a few functions will crack the
  problem.



  Arguments and results of functions can be variables, calculated
  observation by observation.



    This is sometimes overlooked: people plan loops over observations
    when generate or replace will do that automatically.



  Stata is not knee-deep in functions. The intent is rather to provide a
  core of really useful functions. Often you need to combine functions to
  get what you want. This simplicity is really a feature!



  Stata functions, strict sense, can not be written by users.
  You can write functions in Mata.
  And you can write egen functions.



  Functions can not be called by themselves.
  You can use commands to assign or display their results.



  4 Fuzzy boundaries



  The definition of functions is clear, but some extensions seem in order
  for this talk.



  egen will get some attention. egen has its own sense of functions.



  We will also touch on



        extended macro functions



        c-class results



        _n and _N as "pseudo-functions"



  Mata functions and e-class and r-class results will not be touched upon,
  except in passing. Covering those big topics is for other talks or courses.



  5 _variables



  Let us deal briskly with _n and _N.



  _n is the current observation number.



  _N is the total number of observations in the dataset.



  So, don't use any command to try to find this out.



  Under by: _n and _N are corresponding counts within groups defined by
  by:. This is often a quick and easy way to get counts into variables.



  . bysort id : gen npanel = _N



  If by: is unfamiliar, there is a tutorial at SJ 2(1):86-102 (2002).



  Don't forget _pi for 3.14159...



  6 display and graph are your friends



  If you are unclear what a function does, use display on a few examples.



  . di log(10)
  2.3025851



  . di sqr(10)
  Unknown function sqr()
  r(133);



  . di sqrt(10)
  3.1622777



  twoway function shows graphs of functions: SJ 4(4):488-489 (2004).



  . twoway function clip(x,0.2,0.8)                 (ramp)
  . twoway function chop(x,1), ra(0 10)             (staircase)
  . twoway function sin(x)/x , ra(0 `=20 * _pi')    (damped sine)



  7 Strategies and tactics



  Divide and conquer. Often you need one function used repeatedly, or
  two or more functions working together.



    Hypergeometric probabilities: comb(r, k) * comb(n-r, m-k) / comb(n, m)



  This is particularly common with string problems.



  Nest function calls. Feed the results of one function to another,
  e.g. exp(rnormal()). So, cut down on middle macros or middle variables.



    Beta function: exp(lngamma(a) + lngamma(b) - lngamma(a + b))



  Help yourself. Spaces after commas and around operators often make your
  code more readable.



  help fname() gets you straight to help for fname(), when you know it.



  8 Numbers trapped in strings, and vice versa



  real() and string() are the workhorse functions when you know you want
  to change from one form to the other.



    Always remember that string() can take a format argument too.



  Some users employ destring or tostring with the force option
  to change single variables.



  That is backwards: destring and tostring are just elaborate wrappers for
  those functions. If you know you want to force the conversion, you
  should call the appropriate function directly.



  9 Depending on conditions



  cond(a, b, c) returns b if a is true (non-zero) and
                        c if a is false (zero).



  Results can be either numeric or string.



  See SJ 5(3):413-420 (2005) for a tutorial.



  Is year a leap year?



    cond(mod(year, 400) == 0, 1,
    cond(mod(year, 100) == 0, 0,
    cond(mod(year, 4)   == 0, 1,
                                0)))



  If it's at all complicated, get into a text editor that checks
  matching parentheses.



  10 in list or in range?   SJ 6(4):593-595 (2006)



  inlist(z, a, b,...) returns 0 or 1



    1 if z == a | z == b | ...
    0 otherwise



    Note: limits on number of arguments.



    inlist(rep78, 3, 4, 5)



    inlist(1, a, b, c, d, e)            !!!



  inrange(z, a, b) returns 0 or 1



    1 if a <= z <= b
    0 otherwise
    special rules for missing arguments



    inrange(integer, 0, 9)      
    inrange(char, "a", "z")     
    inrange(char, "A", "Z")     



  Note: if !inlist() and if !inrange() (think: if not in list, etc.)



  11 Anything missing?



  With a single numeric variable, if x < . is the cleanest way to
  exclude missings: see P.A. Lachenbruch STB 9:9 (1992).



  if x != . is fine so long as .a ... .z are not around.



  For more complicated problems, turn to missing().
  See B. Rising SJ 10(2):303-304 (2010).



  missing(x1, x2, ..., xn) returns
    1 if any of its arguments is missing and
    0 otherwise.



  It applies to numeric and string arguments.



  !missing() reverses results.



  A statistical short-cut is to use regress and then e(sample).



  12 The search for the absolute



  abs(x) is often overlooked when |x| is wanted. People write



    if t > 2 | t < -2



  when they could write more cleanly, and with less risk of error,



    if abs(t) > 2



  sign(x) returns -1, 0, 1, missing for negative, zero, positive, missing
  arguments.



  logit(x) and invlogit(x) are often overlooked too.



    logit(x) = ln[x / (1 - x)]
    invlogit(x) = exp(x) / [1 + exp(x)]



  13 Rounding up and down



  ceil(x) rounds up always. Think "ceiling". ceil(1.2) is 2.



    ceil(10 * runiform()) is a neat way to get uniformly distributed
    integers 1(1)10. Compare 1 + int(10 * runiform()).
    Apply "for any value of 10".



  floor(x) rounds down always. floor(1.8) is 1.



    if x == floor(x) is one of several tests of whether x is integer.



  int(x) rounds towards zero always, i.e. up for negative numbers.



    int(1.2) is 1. int(-1.2) is -1.



  trunc(x) is a synonym for int(x).



  round(x) rounds to the nearest integer.
  round(x, y) rounds to the nearest multiple of y.



  Warning
    round(1.23, 0.1) may look like 1.2, but it cannot be 1.2 exactly.
    If you want to show so many decimal places, use the right format,
    not round(). To find more: search precision.



    See also SJ 3(4):446-447 (2003) or W. Gould (passim) on precision.



  14 Bin it!



  autocode(), irecode(), recode() are functions for binning a range into
  contiguous intervals.



  See also egen's cut() function.



  Some people use recode for binning continuous variables.



  Over-arching caution: you should check boundary rules carefully.



  2 * floor(myvar/2) illustrates a simpler device.
    This gives bins of width 2 such as [0,2), i.e. lower limits are inclusive.



  2 * ceil(myvar/2) is an alternative.
    This gives bins of width 2 such as (0,2], i.e. upper limits are inclusive.



  15 Going to extremes



  max(x1, x2, ..., xn)



  min(x1, x2, ..., xn)



  A key detail is that missings are ignored unless all arguments are
  missing. This is usually a feature.



  There are work-arounds when you want any missing to trump non-missings.
  One is



  cond(missing(a, b), ., max(a, b))



  If you have lots of arguments, use a loop to get the m??imum.



  If you have a variable, use summarize, meanonly to get the m??imum.



  What about records, such as the maximum so far, and within panels too?



    . gen record = .
    . bysort id (time) : replace record = max(record[_n-1], y)



  16 Any or all?



  If arg is 1 (true) or 0 (false), then



  min(arg) == 0  some false
  min(arg) == 1  all true
  max(arg) == 1  some true
  max(arg) == 0  all false



  which are often useful.



  Using egen's min() or max(), typically in conjunction with by: or by(),
  summarises this by panels or other groups (e.g. families).



  See http://www.stata.com/support/faqs/data-management/create-variable-recording/



  17 Modulus the maestro      



  mod(x, y) is a very versatile function.



  By a standard abuse of terminology, mod() returns the
  remainder (not the modulus as mathematicians know it).
  Such names have been common in programming since the 1950s.



  mod() received its own small song of praise in SJ 7(1):143-145 (2007).



  That article omitted rotations, e.g. mod(angle + 90, 360)



  Is the observation number odd? if mod(_n, 2) == 1 or if mod(_n, 2)



  18 Some sums



  Three key points:



    sum(x) returns cumulative sums.



    sum(x) ignores missings (and returns 0 if all missing).



    egen's total() function is more direct - but less efficient - to put
    group totals in a variable.



  Make sure you know this two-step:



    . bysort id : gen mysum = sum(myvar)
    . by id : replace mysum = mysum[_N]



  After the first command, the last observation in each group contains
  its group total. This is what egen's total() does too.



  How many distinct values of x have been seen so far within panels?



    . bysort id x (time) : gen distinct = _n == 1
    . bysort id (time) : replace distinct = sum(distinct)



  19 Stringing along



  Many data management questions - including clean-ups of data - need
  string functions, often in combination.



  Note a key principle at the outset: Stata's string operations are utterly
  literal.



  Moral: many tests should be phrased in terms of one case and/or consistent
  leading, trailing and internal spaces. You can do that in one with say



  lower(trim(itrim(myvar)))



  Remember: just as in algebra, every left parenthesis ( is a promise to
  write down its match ) sooner or later.



    (Notation is a branch of etiquette, not of logic.)



  My top four string functions - which everyone should know - are
  strpos(), substr(), subinstr() and length().



  20 The best string quartet this side of Vienna: violins



  strpos(s1, s2) tells you where s2 occurs in s1,
  and 0 if it does not occur. Think string position.



    Corollary: if strpos(s1, s2) > 0 or if strpos(s1, s2)
    is a true-or-false test of whether s1 contains s2.



    strpos("this", "is")         is 3
    strpos("this", "it")         is 0
    strpos("haystack", "needle") is 0



    In older versions of Stata, this function was called index().



    Commonly, s1 is a string variable name.
    Less commonly, s2 is a string variable name.



  substr(s, pos, len) gives the substring of s
  starting at position pos and of length len.



    pos can be < 0 indicating position counted backwards from end.
    len can be . indicating everything else.



    substr("abcdef", 2, 3)  is "bcd"
    substr("abcdef", -3, 2) is "de"
    substr("abcdef", 2, .)  is "bcdef"



  21 The best string quartet this side of Vienna: viola and cello



  subinstr(s1, s2, s3, n) changes the first n occurrences in s1 of s2 to s3.



    Note: if s3 is "", occurrences are deleted. This is the zap function!



    subinstr("this is this", "is", "X", 1) is "thX is this"
    subinstr("this is this", "is", "X", 2) is "thX X this"
    subinstr("this is this", "is", "X", .) is "thX X thX"



  length(s) returns the length of s.



    Remember: s can be a string variable name. You get the length of its
    contents.



    length("ab") is 2
    length(myvar) could vary by observation



  22 More advanced string theory



  char(n) returns ASCII character n, so is one way of displaying otherwise
  unprintable characters. See SJ 4(1):95-96 (2004).



  For a convenient display, download asciiplot from SSC.



  Stata has a suite of regular expression functions:
  regexm(), regexr(), regexs(), strmatch().
  They are not very well documented, but start with
  http://www.stata.com/support/faqs/data-management/regular-expressions/



  However, people often dive into regex when the quartet would suffice.



  For one wrapper for regex to extract multiple occurrences of
  substrings, download moss from SSC.



  To work backwards, e.g. to change the last occurrence of a substring
  consider reversing and finally reversing back with
  reverse(...reverse(...)...)



  23 Counting occurrences of substrings



  Let's switch to a problem rather than a function: counting (disjoint)
  occurrences of substrings (SJ 11(2) 318-320).



  For example: count how many "X"s there are in "OOOOXXXOOXXX".



  Many users store short histories (244 periods or less) in string
  variables. Here is one way:



    length(myvar) - length(subinstr(myvar, "X", "", .))



  Take this in steps:



    Get the length of myvar.



    Get the length of myvar with all "X" zapped. (You don't have to do
    that, just find out what the length would be.)



    The difference is what you want.



  This generalises to longer substrings: just remember also to divide by
  length of substring, as you want to count occurrences.



  Generalisation to regular expressions is trickier: see moss (SSC).



  Remember that some operations on substrings are easier after split.



  24 Removing the first word



  There's more than one way to do it.



  Words are separated by spaces. So, look for the first space.



    trim(substr(myvar, strpos(myvar, " "), .))



  This works, perhaps fortuitously, if there is no space present, as
  strpos() then returns 0 and substr() then returns "".



  Use a dedicated function. word() selects individual words.



    trim(subinstr(myvar, word(myvar, 1), "", 1))



  In both cases, we applied trim() last.



  Also, egen's ends() function can do this with its tail option.



  25 Cleaning up species names (binominals)



  The proper form is that genus is capitalised, but not species:
  Homo sapiens, Homo economicus, Troglodytes troglodytes.



    . gen species2 = upper(substr(species, 1, 1)) +
                     lower(substr(species, 2, .))



  The function proper() capitalises each word, not what we want.



  Suppose we want the first two words only, ignoring detail like
  "(Linnaeus, 1758)".



   . replace species2 = word(species2, 1) + " " + word(species2, 2)



  26 Filler apps



  Stata lacks a function quite like rep("X", 80) to replicate strings.



    local text : di _dup(80) "X"



    mata : st_local("text", 80 * "X")



  are two ways to do it.



  With a supply of filler, you can add as much as you want with



    substr("`text'", 1, len)



  Finally, you could always type it out yourself.



  27 Avoid conflicts



  scalar(x) and matrix(X) insist that you want the scalar named x
  and the matrix named X, and not any variable with the same name
  (or the same unambiguous abbreviation).



  Variables, scalars and matrices share the same namespace, and the
  variable interpretation always trumps the others.



  Using tempnames helps avoid the problem in another way.



  See G.I. Kolev SJ 6(2):279-280 (2006).



  28 c-class citizens 



  There are many more than this list. See help creturn.



  c(current_date)    e.g. "13 Apr 2011"
  c(current_time)    e.g. "09:47:13"
  c(stata_version)   version of Stata
  c(version)         version set by version command
  c(N)               number of observations in dataset
  c(k)               number of variables in dataset
  c(changed)         0 if dataset not changed since last saved, 1 otherwise
  c(seed)            current set seed setting
  c(pi)              _pi
  c(alpha)           a b c ... z
  c(ALPHA)           A B C ... Z
  c(Mons)            Jan Feb Mar ... Dec
  c(Months)          January February March ... December
  c(Wdays)           Sun Mon Tue Wed Thu Fri Sat
  c(Weekdays)        Sunday Monday Tuesday Wednesday Thursday Friday Saturday



  Results are often best invoked as `c(name)' rather than c(name).



  See also SJ 4(2):223 (2004).



  29 Extended macro functions



  These are documented at help extended_fcn
  (if you can't remember that, start at help macro).



  They are mainly, but not exclusively, useful to programmers.



  They cover, among many other tasks,



    automating look-up of variable types, formats,
    variable and value labels, matrix stuff, constraints, etc.;



    displaying manipulations on the fly; and



    macro manipulation that would otherwise be difficult,
    say because of length limits.



  . local type : type myvar
  . local format : format myvar
  . local label : variable label myvar
  . local text : display %3.1f r(mean)
    (as for sensible rounding in graph annotation)



  30 egen highlights



  egen is an unusual command. It is a wrapper for calling its own
  "functions", but only one at a time. Also, egen functions cannot be
  used outside egen. egen creates new variables (and nothing else).



  More positively, egen is a convenient ragbag for functions of less
  importance, and it provides template code that users can imitate.



  egen [type] newvar = fcn(arguments) [if] [in] [, options]



  by: is allowed with some egen functions (or a by() option, currently
  undocumented).



  The arguments can often be expressions, a point often overlooked.



  count(exp)      counts non-missings
  cut(varname)    bins variables
  group(varlist)  a valuable workhorse for assigning identifiers 1 up
  seq()           integer sequences
  tag(varlist)    tags just one value in a homogeneous group



  max(exp)        key summary statistics
  mean(exp)
  median(exp)
  min(exp)
  pctile(exp)
  rank(exp)
  sd(exp)
  total(exp)



  row*(varlist)   various row (across observation) operations



  31 Properties of the other members of any group



  A basic recipe is



  . egen total = total(myvar), by(id)
  . egen count = count(myvar), by(id)



  . gen meanothers =
        (total - cond(missing(myvar), 0, myvar)) / (count - !missing(myvar))



  See http://www.stata.com/support/faqs/data-management/creating-variables-recording-properties/



  32 Counting distinct values



  . egen tag = tag(id myvar)
  . egen total = total(tag), by(id)



  Some people say "unique" when they mean "distinct".



  See also SJ 8(4):557-568 (2008).



  33 User-written egen functions



  . findit egen



  points to locations.



  egenmore (SSC) is the largest single collection.



  34 Exercise



  1. This works in Stata:
        . egen gmean = mean(ln(y)), by(id)
        . replace gmean = exp(gmean)



  2. This doesn't work in Stata:
        . egen gmean = exp(mean(ln(y)), by(id)



  3. This does work in Mata:
        exp(mean(ln(y)))



  Explain. Is Mata easier to learn than Stata? 



  35 Acknowledgments



  Helpful comments from
     Stephen Jenkins
     Roger Newson