Fun and fluency with functions

Nicholas J. Cox, Durham University [email protected]

1 What are functions?

Functions, strict sense, take zero or more arguments and return single results. They are documented in [D] functions and at functions.

Examples are runiform(), ln(42), strpos("Stata", "S").

The () syntax is generic. Think if you like of an open mouth expecting to be fed.

There are functions you know you need and functions you don't know you want. The aim of this talk is to tell you more about the latter than the former. It includes only some highlights, not the complete tour.

If you need date, trigonometric, hyperbolic, gamma, density functions, and so forth, it is mostly just a matter of finding out the syntax.

Henceforth, SJ = Stata Journal, STB = Stata Technical Bulletin, SSC = SSC archive.

A first overview reference is SJ 2(4):411-427 (2002).

2 A request for new users

Note for new Stata users:

Stata commands are not considered to be functions.

list and regress are to be called Stata commands, not functions, regardless of terminology elsewhere.

That's a polite request, not a command.

3 The big picture

Functions can be more useful than you think. Users often seek commands or imagine they need programs when a few functions will crack the problem.

Arguments and results of functions can be variables, calculated observation by observation.

This is sometimes overlooked: people plan loops over observations when generate or replace will do that automatically.

Stata is not knee-deep in functions. The intent is rather to provide a core of really useful functions. Often you need to combine functions to get what you want. This simplicity is really a feature!

Stata functions, strict sense, can not be written by users. You can write functions in Mata. And you can write egen functions.

Functions can not be called by themselves. You can use commands to assign or display their results.

4 Fuzzy boundaries

The definition of functions is clear, but some extensions seem in order for this talk.

egen will get some attention. egen has its own sense of functions.

We will also touch on

extended macro functions

c-class results

_n and _N as "pseudo-functions"

Mata functions and e-class and r-class results will not be touched upon, except in passing. Covering those big topics is for other talks or courses.

5 _variables

Let us deal briskly with _n and _N.

_n is the current observation number.

_N is the total number of observations in the dataset.

So, don't use any command to try to find this out.

Under by: _n and _N are corresponding counts within groups defined by by:. This is often a quick and easy way to get counts into variables.

. bysort id : gen npanel = _N

If by: is unfamiliar, there is a tutorial at SJ 2(1):86-102 (2002).

Don't forget _pi for 3.14159...

6 display and graph are your friends

If you are unclear what a function does, use display on a few examples.

. di log(10) 2.3025851

. di sqr(10) Unknown function sqr() r(133);

. di sqrt(10) 3.1622777

twoway function shows graphs of functions: SJ 4(4):488-489 (2004).

. twoway function clip(x,0.2,0.8) (ramp) . twoway function chop(x,1), ra(0 10) (staircase) . twoway function sin(x)/x , ra(0 `=20 * _pi') (damped sine)

7 Strategies and tactics

Divide and conquer. Often you need one function used repeatedly, or two or more functions working together.

Hypergeometric probabilities: comb(r, k) * comb(n-r, m-k) / comb(n, m)

This is particularly common with string problems.

Nest function calls. Feed the results of one function to another, e.g. exp(rnormal()). So, cut down on middle macros or middle variables.

Beta function: exp(lngamma(a) + lngamma(b) - lngamma(a + b))

Help yourself. Spaces after commas and around operators often make your code more readable.

help fname() gets you straight to help for fname(), when you know it.

8 Numbers trapped in strings, and vice versa

real() and string() are the workhorse functions when you know you want to change from one form to the other.

Always remember that string() can take a format argument too.

Some users employ destring or tostring with the force option to change single variables.

That is backwards: destring and tostring are just elaborate wrappers for those functions. If you know you want to force the conversion, you should call the appropriate function directly.

9 Depending on conditions

cond(a, b, c) returns b if a is true (non-zero) and c if a is false (zero).

Results can be either numeric or string.

See SJ 5(3):413-420 (2005) for a tutorial.

Is year a leap year?

cond(mod(year, 400) == 0, 1, cond(mod(year, 100) == 0, 0, cond(mod(year, 4) == 0, 1, 0)))

If it's at all complicated, get into a text editor that checks matching parentheses.

10 in list or in range? SJ 6(4):593-595 (2006)

inlist(z, a, b,...) returns 0 or 1

1 if z == a | z == b | ... 0 otherwise

Note: limits on number of arguments.

inlist(rep78, 3, 4, 5)

inlist(1, a, b, c, d, e) !!!

inrange(z, a, b) returns 0 or 1

1 if a <= z <= b 0 otherwise special rules for missing arguments

inrange(integer, 0, 9) inrange(char, "a", "z") inrange(char, "A", "Z")

Note: if !inlist() and if !inrange() (think: if not in list, etc.)

11 Anything missing?

With a single numeric variable, if x < . is the cleanest way to exclude missings: see P.A. Lachenbruch STB 9:9 (1992).

if x != . is fine so long as .a ... .z are not around.

For more complicated problems, turn to missing(). See B. Rising SJ 10(2):303-304 (2010).

missing(x1, x2, ..., xn) returns 1 if any of its arguments is missing and 0 otherwise.

It applies to numeric and string arguments.

!missing() reverses results.

A statistical short-cut is to use regress and then e(sample).

12 The search for the absolute

abs(x) is often overlooked when |x| is wanted. People write

if t > 2 | t < -2

when they could write more cleanly, and with less risk of error,

if abs(t) > 2

sign(x) returns -1, 0, 1, missing for negative, zero, positive, missing arguments.

logit(x) and invlogit(x) are often overlooked too.

logit(x) = ln[x / (1 - x)] invlogit(x) = exp(x) / [1 + exp(x)]

13 Rounding up and down

ceil(x) rounds up always. Think "ceiling". ceil(1.2) is 2.

ceil(10 * runiform()) is a neat way to get uniformly distributed integers 1(1)10. Compare 1 + int(10 * runiform()). Apply "for any value of 10".

floor(x) rounds down always. floor(1.8) is 1.

if x == floor(x) is one of several tests of whether x is integer.

int(x) rounds towards zero always, i.e. up for negative numbers.

int(1.2) is 1. int(-1.2) is -1.

trunc(x) is a synonym for int(x).

round(x) rounds to the nearest integer. round(x, y) rounds to the nearest multiple of y.

Warning round(1.23, 0.1) may look like 1.2, but it cannot be 1.2 exactly. If you want to show so many decimal places, use the right format, not round(). To find more: search precision.

See also SJ 3(4):446-447 (2003) or W. Gould (passim) on precision.

14 Bin it!

autocode(), irecode(), recode() are functions for binning a range into contiguous intervals.

See also egen's cut() function.

Some people use recode for binning continuous variables.

Over-arching caution: you should check boundary rules carefully.

2 * floor(myvar/2) illustrates a simpler device. This gives bins of width 2 such as [0,2), i.e. lower limits are inclusive.

2 * ceil(myvar/2) is an alternative. This gives bins of width 2 such as (0,2], i.e. upper limits are inclusive.

15 Going to extremes

max(x1, x2, ..., xn)

min(x1, x2, ..., xn)

A key detail is that missings are ignored unless all arguments are missing. This is usually a feature.

There are work-arounds when you want any missing to trump non-missings. One is

cond(missing(a, b), ., max(a, b))

If you have lots of arguments, use a loop to get the m??imum.

If you have a variable, use summarize, meanonly to get the m??imum.

What about records, such as the maximum so far, and within panels too?

. gen record = . . bysort id (time) : replace record = max(record[_n-1], y)

16 Any or all?

If arg is 1 (true) or 0 (false), then

min(arg) == 0 some false min(arg) == 1 all true max(arg) == 1 some true max(arg) == 0 all false

which are often useful.

Using egen's min() or max(), typically in conjunction with by: or by(), summarises this by panels or other groups (e.g. families).

See http://www.stata.com/support/faqs/data-management/create-variable-recording/

17 Modulus the maestro

mod(x, y) is a very versatile function.

By a standard abuse of terminology, mod() returns the remainder (not the modulus as mathematicians know it). Such names have been common in programming since the 1950s.

mod() received its own small song of praise in SJ 7(1):143-145 (2007).

That article omitted rotations, e.g. mod(angle + 90, 360)

Is the observation number odd? if mod(_n, 2) == 1 or if mod(_n, 2)

18 Some sums

Three key points:

sum(x) returns cumulative sums.

sum(x) ignores missings (and returns 0 if all missing).

egen's total() function is more direct - but less efficient - to put group totals in a variable.

Make sure you know this two-step:

. bysort id : gen mysum = sum(myvar) . by id : replace mysum = mysum[_N]

After the first command, the last observation in each group contains its group total. This is what egen's total() does too.

How many distinct values of x have been seen so far within panels?

. bysort id x (time) : gen distinct = _n == 1 . bysort id (time) : replace distinct = sum(distinct)

19 Stringing along

Many data management questions - including clean-ups of data - need string functions, often in combination.

Note a key principle at the outset: Stata's string operations are utterly literal.

Moral: many tests should be phrased in terms of one case and/or consistent leading, trailing and internal spaces. You can do that in one with say

lower(trim(itrim(myvar)))

Remember: just as in algebra, every left parenthesis ( is a promise to write down its match ) sooner or later.

(Notation is a branch of etiquette, not of logic.)

My top four string functions - which everyone should know - are strpos(), substr(), subinstr() and length().

20 The best string quartet this side of Vienna: violins

strpos(s1, s2) tells you where s2 occurs in s1, and 0 if it does not occur. Think string position.

Corollary: if strpos(s1, s2) > 0 or if strpos(s1, s2) is a true-or-false test of whether s1 contains s2.

strpos("this", "is") is 3 strpos("this", "it") is 0 strpos("haystack", "needle") is 0

In older versions of Stata, this function was called index().

Commonly, s1 is a string variable name. Less commonly, s2 is a string variable name.

substr(s, pos, len) gives the substring of s starting at position pos and of length len.

pos can be < 0 indicating position counted backwards from end. len can be . indicating everything else.

substr("abcdef", 2, 3) is "bcd" substr("abcdef", -3, 2) is "de" substr("abcdef", 2, .) is "bcdef"

21 The best string quartet this side of Vienna: viola and cello

subinstr(s1, s2, s3, n) changes the first n occurrences in s1 of s2 to s3.

Note: if s3 is "", occurrences are deleted. This is the zap function!

subinstr("this is this", "is", "X", 1) is "thX is this" subinstr("this is this", "is", "X", 2) is "thX X this" subinstr("this is this", "is", "X", .) is "thX X thX"

length(s) returns the length of s.

Remember: s can be a string variable name. You get the length of its contents.

length("ab") is 2 length(myvar) could vary by observation

22 More advanced string theory

char(n) returns ASCII character n, so is one way of displaying otherwise unprintable characters. See SJ 4(1):95-96 (2004).

For a convenient display, download asciiplot from SSC.

Stata has a suite of regular expression functions: regexm(), regexr(), regexs(), strmatch(). They are not very well documented, but start with http://www.stata.com/support/faqs/data-management/regular-expressions/

However, people often dive into regex when the quartet would suffice.

For one wrapper for regex to extract multiple occurrences of substrings, download moss from SSC.

To work backwards, e.g. to change the last occurrence of a substring consider reversing and finally reversing back with reverse(...reverse(...)...)

23 Counting occurrences of substrings

Let's switch to a problem rather than a function: counting (disjoint) occurrences of substrings (SJ 11(2) 318-320).

For example: count how many "X"s there are in "OOOOXXXOOXXX".

Many users store short histories (244 periods or less) in string variables. Here is one way:

length(myvar) - length(subinstr(myvar, "X", "", .))

Take this in steps:

Get the length of myvar.

Get the length of myvar with all "X" zapped. (You don't have to do that, just find out what the length would be.)

The difference is what you want.

This generalises to longer substrings: just remember also to divide by length of substring, as you want to count occurrences.

Generalisation to regular expressions is trickier: see moss (SSC).

Remember that some operations on substrings are easier after split.

24 Removing the first word

There's more than one way to do it.

Words are separated by spaces. So, look for the first space.

trim(substr(myvar, strpos(myvar, " "), .))

This works, perhaps fortuitously, if there is no space present, as strpos() then returns 0 and substr() then returns "".

Use a dedicated function. word() selects individual words.

trim(subinstr(myvar, word(myvar, 1), "", 1))

In both cases, we applied trim() last.

Also, egen's ends() function can do this with its tail option.

25 Cleaning up species names (binominals)

The proper form is that genus is capitalised, but not species: Homo sapiens, Homo economicus, Troglodytes troglodytes.

. gen species2 = upper(substr(species, 1, 1)) + lower(substr(species, 2, .))

The function proper() capitalises each word, not what we want.

Suppose we want the first two words only, ignoring detail like "(Linnaeus, 1758)".

. replace species2 = word(species2, 1) + " " + word(species2, 2)

26 Filler apps

Stata lacks a function quite like rep("X", 80) to replicate strings.

local text : di _dup(80) "X"

mata : st_local("text", 80 * "X")

are two ways to do it.

With a supply of filler, you can add as much as you want with

substr("`text'", 1, len)

Finally, you could always type it out yourself.

27 Avoid conflicts

scalar(x) and matrix(X) insist that you want the scalar named x and the matrix named X, and not any variable with the same name (or the same unambiguous abbreviation).

Variables, scalars and matrices share the same namespace, and the variable interpretation always trumps the others.

Using tempnames helps avoid the problem in another way.

See G.I. Kolev SJ 6(2):279-280 (2006).

28 c-class citizens

There are many more than this list. See help creturn.

c(current_date) e.g. "13 Apr 2011" c(current_time) e.g. "09:47:13" c(stata_version) version of Stata c(version) version set by version command c(N) number of observations in dataset c(k) number of variables in dataset c(changed) 0 if dataset not changed since last saved, 1 otherwise c(seed) current set seed setting c(pi) _pi c(alpha) a b c ... z c(ALPHA) A B C ... Z c(Mons) Jan Feb Mar ... Dec c(Months) January February March ... December c(Wdays) Sun Mon Tue Wed Thu Fri Sat c(Weekdays) Sunday Monday Tuesday Wednesday Thursday Friday Saturday

Results are often best invoked as `c(name)' rather than c(name).

See also SJ 4(2):223 (2004).

29 Extended macro functions

These are documented at help extended_fcn (if you can't remember that, start at help macro).

They are mainly, but not exclusively, useful to programmers.

They cover, among many other tasks,

automating look-up of variable types, formats, variable and value labels, matrix stuff, constraints, etc.;

displaying manipulations on the fly; and

macro manipulation that would otherwise be difficult, say because of length limits.

. local type : type myvar . local format : format myvar . local label : variable label myvar . local text : display %3.1f r(mean) (as for sensible rounding in graph annotation)

30 egen highlights

egen is an unusual command. It is a wrapper for calling its own "functions", but only one at a time. Also, egen functions cannot be used outside egen. egen creates new variables (and nothing else).

More positively, egen is a convenient ragbag for functions of less importance, and it provides template code that users can imitate.

egen [type] newvar = fcn(arguments) [if] [in] [, options]

by: is allowed with some egen functions (or a by() option, currently undocumented).

The arguments can often be expressions, a point often overlooked.

count(exp) counts non-missings cut(varname) bins variables group(varlist) a valuable workhorse for assigning identifiers 1 up seq() integer sequences tag(varlist) tags just one value in a homogeneous group

max(exp) key summary statistics mean(exp) median(exp) min(exp) pctile(exp) rank(exp) sd(exp) total(exp)

row*(varlist) various row (across observation) operations

31 Properties of the other members of any group

A basic recipe is

. egen total = total(myvar), by(id) . egen count = count(myvar), by(id)

. gen meanothers = (total - cond(missing(myvar), 0, myvar)) / (count - !missing(myvar))

See http://www.stata.com/support/faqs/data-management/creating-variables-recording-properties/

32 Counting distinct values

. egen tag = tag(id myvar) . egen total = total(tag), by(id)

Some people say "unique" when they mean "distinct".

See also SJ 8(4):557-568 (2008).

33 User-written egen functions

. findit egen

points to locations.

egenmore (SSC) is the largest single collection.

34 Exercise

1. This works in Stata: . egen gmean = mean(ln(y)), by(id) . replace gmean = exp(gmean)

2. This doesn't work in Stata: . egen gmean = exp(mean(ln(y)), by(id)

3. This does work in Mata: exp(mean(ln(y)))

Explain. Is Mata easier to learn than Stata?

35 Acknowledgments

Helpful comments from Stephen Jenkins Roger Newson