ZLAvian

Introduction

The package ZLAvian tests for patterns consistent with Zipf’s Law of Abbreviation (ZLA) in animal communication following the methods described in Lewis et al. (2023) and Gilman et al. (2023).

library(ZLAvian)

testZLA

This function measures and tests the statistical significance of the concordance between note duration and frequency of use in a sample of animal communication represented in a dataframe that must include columns with the following names and information:

A column “note” (factor/character) that describes the type to which each note, phrase, or syllable in the sample has been assigned;
A column “duration” (numeric) that describes the duration of each note, phrase, or syllable; and
A column “ID” (factor) identifies the individual animal that produced each note, phrase, or syllable.

Other columns in the dataframe are ignored in the analysis. Youngblood (2024) observed that the column duration might alternatively include data on any other numerical measure that estimates the effort involved in producing a note type.

testZLA computes the mean concordance (i.e., Kendall’s tau) between note duration and frequency of use within individuals, averages across all individuals in the data set, and compares this to the expectation under the null hypothesis that note duration and frequency of use are unrelated. The null distribution is computed by permutation while constraining for the observed similarity of note repertoires among individuals. This controls for the possibility that individuals in the population learn their repertoires from others. The significance test is one-tailed, so p-values close to 1 suggest evidence for a positive concordances, contrary to ZLA. See Lewis at al. (2023) for the formal computation of the null distribution and Gilman et al. (2023) for discussion.

Users can control the following parameters in testZLA:

minimum: the minimum number of times a note type must appear in the data set to be included in the analysis. All notes types that appear less than this number of times are removed from the sample before analysis. minimium must be a positive integer. The default value is 1.
null: the number of permutations used to estimate the null distribution. This must be a positive integer 99 or greater. The default value is 999.
est: takes values “mixed,” “mean,” or “median.” If est = “mixed” then the expected logged duration for each note type in the population is computed as the intercept from an intercept-only mixed effects model (fit using the lmer() function of lme4) that includes a random effect of individual ID. This computes a weighted mean across individuals, and accords greater weights to individuals that produce the note type more frequently. If est = “mean” then the expected logged duration for each note type in the population is computed as the mean of the means for the individuals, with each individual weighted equally. If est = “median” then the expected logged duration for each note type within individuals is taken to be the median logged duration of the note type when produced by that individual, and the expected logged duration for each note type in the population is taken to be the median of the medians for the individuals that produced that note type.The expected durations for note types are used in the permutation algorithm. Estimation using the “mixed” approach is more precise - it gives greater weight to means based on more notes, because we can estimate those means more accurately. However, estimation using the “mean” approach is faster.
cores: The permutation process in testZLA is computationally expensive. To make simulating the null distribution faster, testZLA allows users to assign the task to multiple cores (i.e., parallelization). cores must be an integer between 1 and the number of cores available on the user’s machine, inclusive. Users can find the number of cores on their machines using the function detectCores() in the package parallel. The default value is 2.
transform: Takes values “log” or “none.” Indicates how duration data should be transformed prior to analysis. Defaults to “log.” Gilman and colleagues (2023) argue that log transformation is often appropriate for duration data, but other measures might be better analysed as untransformed data.

data(testdata, package = "ZLAvian")

test.ZLA.output = testZLA(data, minimum = 1, null = 999, est = "mixed", cores = 2)
#>                          tau         p
#> individual level -0.03745038 0.4110000
#> population level -0.15000000 0.2088536

testZLA prints a table that reports concordances (tau) and p-values at the individual and population levels. Results at the individual level are obtained using the method described in Lewis et al (2023) and Gilman et al (2023). Results at the population level report the concordance between note type duration and frequency of use in the full dataset, without considering which individuals produced which notes. Population-level concordances may be problematic when studying ZLA in animal communication (see Gilman et al 2023 for discussion) but have been widely used to study ZLA in human languages.

Further information can be extracted from the function:

$stats: a matrix that reports Kendall’s tau and the p-value associated with Kendall’s tau computed at both the population and individual levels.
$unweighted: in stats, the population mean value of Kendall’s tau is computed with the value of tau for each individual weighted by its inverse variance. The inverse variance depends on the individual’s repertoire size. In unweighted, Kendall’s tau and the p-value associated with tau are computed with tau for each individual weighted equally regardless of repertoire size.
$overview: a matrix that reports the total number of notes, total number of individuals, total number of note types, mean number of notes per individual, and mean number of note types produced by each individual in the dataset.
$shannon: a matrix that reports the Shannon diversity of note classes in the population and the mean Shannon diversity of note classes used by individuals in the dataset.
$plotObject: a list containing data used by the plotZLA function to produce a web plot illustrating the within-individual concordance in the data set.
$thresholds: a matrix that reports the 90% inclusion interval for the null distribution for the mean value Kendall’s tau in population, both with and without weighting individual taus by their inverse variances. The lower bound represents the least negative concordance that would be inferred to be evidence for ZLA, and is thus a measure of the power of the analysis.

plotZLA

plotZLA uses the output of testZLA to create a web plot illustrating the concordance between note class duration and frequency of use within individuals in the population.

Users can control the following parameters in testZLA:

title: a title for the webplot.
ylab: a label for the y-axis of the webplot. Defaults to “duration (s).”
x.scale: takes values “log” or “linear.” Indicates how the x-axis should be scaled. Defaults to “log.”
y.base: takes values 2 or 10. Controls tick mark positions on the y-axis. When set to 2, tick marks indicate that values differ by a factor of 2. When set to 10, tick marks indicate that values differ by a factor of 10. If data was not log transformed in the analysis being illustrated, then the y-axis is linear and this argument is ignored. Defaults to 2.

store <- testZLA(data, minimum = 1, null = 999, est = "mixed", cores = 2)
#>                          tau         p
#> individual level -0.03745038 0.4250000
#> population level -0.15000000 0.2088536

plotZLA(store, ylab = "duration (ms)", x.scale = "linear")

In the figure produced by plotZLA, each point represents a note or phrase type in the population repertoire. Note types are joined by a line if at least one individual produces both note types. The weight of the line is proportional to the number of individuals that produce both note types. The color of the lines indicates whether there is a positive (blue) or negative (red) concordance between the duration and frequency of use of the joined note types. Negative concordances are consistent with Zipf’s law of abbreviation. Shades between blue and red indicate that the concordance is positive in some individuals and negative in others. For example, this can happen if some individuals use the note types more frequently than others, such that the rank order of frequency of use varies among individuals. Grey crosses centered on each point show the longest and shortest durations of the note type (vertical) and the highest and lowest frequencies of use (horizontal) in the population.

References

Gilman, R. T., Durrant, C. D., Malpas, L., and Lewis, R. N. (2023) Does Zipf’s law of abbreviation shape birdsong? bioRxiv (doi.org/10.1101/2023.12.06.569773).

Lewis, R. N., Kwong, A., Soma, M., de Kort, S. R., Gilman, R. T. (2023) Java sparrow song conforms to Mezerath’s law but not to Zipf’s law of abbreviation. bioRxiv (doi.org/10.1101/2023.12.13.571437).

Youngblood, M. (2024) Language-like efficiency and structure in house finch song. Proceedings of the Royal Society B - Biological Sciences, 291:20240250 (doi.org/10.1098/rspb.2024.0250).