Posts Tagged ‘linguistics’

Visiting UCLA


I shall be spending next week (4th-9th March) at the UCLA Department of Linguistics, where I shall give five hours of talks on Statistics for Historical Linguistics: a two-hour review of the field (Thursday morning), a two-hour presentation of my own DPhil work (pdf) (Friday afternoon), and a one-hour tutorial on TraitLab (Friday afternoon?). Abstracts:

1. Statistical methods in Historical Linguistics (Thursday morning)
Recent advances in our understanding of language change, in statistical methodology, and in computational power, along with an increasing wealth of available data, have allowed significant progress in statistical modelling of language change, and quantitative methods are gaining traction in Historical Linguistics. Models have been developed for the change through time of vocabulary, morpho-syntactic and phonetic traits. I shall present a review of these models (from a statistician’s point of view), starting with Morris Swadesh’s failed attempts at glottochronology, then looking at some models developed in the last decade. In parallel, I shall provide brief insights into statistical tools such as Bayesian statistics and Markov Chain Monte Carlo, in order to show how to use these effectively for linguistic applications.
2. A phylogenetic model of language diversification (Friday afternoon)
Language diversification is a random process similar in many ways to biological evolution. We model the diversification of so-called “core” lexical data by a stochastic process on a phylogenetic tree. We initially focus on the Indo-European language family. The age of the most recent common ancestor of these languages is of particular interest and issues of dating ancient languages have been subject to controversy. We use Markov Chain Monte Carlo to estimate the tree topology, internal node ages and model parameters. Our model includes several aspects specific to language diversification, such as rate heterogeneity and the data registration process, and we show that lexical borrowing does not bias our estimates. We show the robustness of our model through extensive validation and analyse two independent data sets to estimates the age of Proto-Indo-European. We then analyse a data set of Semitic languages, and show an extension of our model to explore whether languages evolve in “punctuational bursts”. Finally, we revisit an analysis of several small data sets by Bergsland & Vogt (1962).
Joint work with Geoff Nicholls

3. Tutorial and practical: TraitLab, a package for phylogenies of linguistic and cultural traits
In this tutorial, I shall present how to use the TraitLab package, which was initially developed specifically for the modelling of core vocabulary change through time, and guide interested attendants through an analysis of a simple data set. TraitLab is a software package for simulating, fitting and analysing tree-like binary data under a stochastic Dollo model of evolution. It handles “catastrophic” rate heterogeneity and missing data. The core of the package is a Markov chain Monte Carlo (MCMC) sampling algorithm that enables the user to sample from the Bayesian joint posterior distributions for tree topologies, clade and root ages, and the trait loss and catastrophe rates for a given data set. Data can be simulated according to the fitted Dollo model or according to a number of generalized models that allow for borrowing (horizontal transfer) of traits, heterogeneity in the trait loss rate and biases in the data collection  process. Both the raw data and the output of MCMC runs can be inspected using a number of useful graphical and analytical tools  provided within the package. TraitLab is freely available and runs within the Matlab computing environment.

Attendants who wish to use TraitLab during the practical should have a computer with Matlab installed.

TraitLab was developed jointly with Geoff Nicholls and David Welch.

Interview in La Recherche


La Recherche, one of the leading French popular science magazines, has a monthly piece in which they present a Mathematician’s work in two pages. I had the honour and great pleasure of being featured in this month’s issue. Philippe Pajot interviewed me for over an hour about my work on models of language diversification and did a pretty good job at summarizing it in a few hundred words, with a presentation of the linguistic question, the data, and a mention of the issues of validation and error bars. There is also a brief attempt at explaining MCMC.

A scan is below, for French-speaking readers.

Pour la Science article on Indo-European expansion


Phylogenetic models of language diversification seem to be popular these days in French popular science magazines. Of the leading publications, La Recherche will feature an interview with yours truly in March, and Pour la Science has an 8-page cover story on the subject in the current issue.

Popularizer Ruth Berger looks at the expansion of the Indo-Europeans from genetic and linguistic points of view, trying to reconcile them and to decide between the Kurgan (horsemen) and Anatolian (farmers) possible origins of Indo-European expansion. For the linguistics half, she looks at phylogenetic models to infer genealogies and dates, but skips the methodology and reproduces directly trees by Gray & Atkinson (2003) and Atkinson et al. (2005).

It is a shame that the method is presented as a black box. Given the length of the article, it would have been possible to give a general idea of how dates are inferred: the ages of parts of the tree are known, and this information is used to estimate rates of change and other ages. Instead, the author suggests that the rates are already known [whence?] and are fed to the black box, which magically outputs a tree and dates.  There is barely anything about the uncertainty of the estimates, and nothing about validation. I also have trouble understanding the points made at the end about Linear A and the attempt to merge the Anatolian and Kurgan hypotheses.

This issue is number 400 of Pour la Science. According to the editor-in-chief, they decided to celebrate with an issue on “theories and models”. Indo-European expansion is one of their examples, along with pieces on Grand Unification and on gene transfers. Uncertainty and validation are major parts of any decent modelling endeavour, and it is a shame that they did not seize the opportunity to educate their readership about these issues.

I suppose it is hardly surprising that I am disappointed with a popular science paper on a topic related to my PhD…

Editorial on phylogenetic models of language change


The March edition of the newsletter of the Centre de Recherche en Économie et Statistique is now available online (in French). The main feature of the newsletter is a non-technical introduction to phylogenetic models of language change, and specifically the issue of dating Proto-Indo-European. This is inspired by our paper with Geoff Nicholls and corresponds to work done doing my DPhil.