I shall be spending next week (4th-9th March) at the UCLA Department of Linguistics, where I shall give five hours of talks on Statistics for Historical Linguistics: a two-hour review of the field (Thursday morning), a two-hour presentation of my own DPhil work (pdf) (Friday afternoon), and a one-hour tutorial on TraitLab (Friday afternoon?). Abstracts:
1. Statistical methods in Historical Linguistics (Thursday morning)
Recent advances in our understanding of language change, in statistical methodology, and in computational power, along with an increasing wealth of available data, have allowed significant progress in statistical modelling of language change, and quantitative methods are gaining traction in Historical Linguistics. Models have been developed for the change through time of vocabulary, morpho-syntactic and phonetic traits. I shall present a review of these models (from a statistician’s point of view), starting with Morris Swadesh’s failed attempts at glottochronology, then looking at some models developed in the last decade. In parallel, I shall provide brief insights into statistical tools such as Bayesian statistics and Markov Chain Monte Carlo, in order to show how to use these effectively for linguistic applications.
2. A phylogenetic model of language diversification (Friday afternoon)
Language diversification is a random process similar in many ways to biological evolution. We model the diversification of so-called “core” lexical data by a stochastic process on a phylogenetic tree. We initially focus on the Indo-European language family. The age of the most recent common ancestor of these languages is of particular interest and issues of dating ancient languages have been subject to controversy. We use Markov Chain Monte Carlo to estimate the tree topology, internal node ages and model parameters. Our model includes several aspects specific to language diversification, such as rate heterogeneity and the data registration process, and we show that lexical borrowing does not bias our estimates. We show the robustness of our model through extensive validation and analyse two independent data sets to estimates the age of Proto-Indo-European. We then analyse a data set of Semitic languages, and show an extension of our model to explore whether languages evolve in “punctuational bursts”. Finally, we revisit an analysis of several small data sets by Bergsland & Vogt (1962).
Joint work with Geoff Nicholls
3. Tutorial and practical: TraitLab, a package for phylogenies of linguistic and cultural traits
In this tutorial, I shall present how to use the TraitLab package, which was initially developed specifically for the modelling of core vocabulary change through time, and guide interested attendants through an analysis of a simple data set. TraitLab is a software package for simulating, fitting and analysing tree-like binary data under a stochastic Dollo model of evolution. It handles “catastrophic” rate heterogeneity and missing data. The core of the package is a Markov chain Monte Carlo (MCMC) sampling algorithm that enables the user to sample from the Bayesian joint posterior distributions for tree topologies, clade and root ages, and the trait loss and catastrophe rates for a given data set. Data can be simulated according to the fitted Dollo model or according to a number of generalized models that allow for borrowing (horizontal transfer) of traits, heterogeneity in the trait loss rate and biases in the data collection process. Both the raw data and the output of MCMC runs can be inspected using a number of useful graphical and analytical tools provided within the package. TraitLab is freely available and runs within the Matlab computing environment.
Attendants who wish to use TraitLab during the practical should have a computer with Matlab installed.
TraitLab was developed jointly with Geoff Nicholls and David Welch.
Below are the slides of my presentation earlier today at the Institut Jean Nicod (ENS Ulm). The atmosphere was very friendly and we had some pleasant discussions afterwards; some of the questions about the Statistics were surprisingly acute coming from non-specialists, for example questions on prior selection.
I shall be giving a talk at the SIGMA seminar of the Institut Jean Nicod (Ecole Normale Supérieure) tomorrow.
Topic: “Phylogenetic models of language diversification”
Where: IJN/LSCP seminar room, 29 rue d’Ulm, Paris
When: Wednesday October 6th, 11:00-12:30
The slides will be available on Slideshare tomorrow.
The Greek Stochastics Beta’ conference is ongoing. Apart from the trips to the beaches at the end of the day, one of the highlights so far was Garrett Hellenthal‘s talk on “A new statistical method to identify and date population admixture events using dense genetic variation data“.
The authors have studied human DNA from various populations around the world to detect admixture (basically, when one population receives genetic material from another population, as might happen for example after an invasion). They look at the 53 populations from the Human Genome Diversity Project and detect between which populations there has been admixture.
What is impressive is that by looking at recombination, they are able to date the admixture events. For example, looking at South American populations, they reconstruct admixture from Europeans dating back to the conquistadores. They also find a strong signal in several places around Central Asia for admixture from Mongolia shortly after the time of Gengis Khan.
The method can go back about 150 generations (or around 2000 B.C.). This will have fascinating implications in better understanding recent human history.
As part of the Paris-Montagne festival on the importance of making mistakes in scientific research, I attended a screening last night of Bonjour les morses (Hello walruses). Antonio Fischetti followed CNRS bioacousticians Thierry Aubin and Isabelle Charrier on an expedition to study walrus communication in the Arctic.
The film is unique in that the expedition failed rather miserably: after spending two weeks stuck at the base due to bad weather, the team was finally able to go out. On the second day, they were stranded on an iceberg because the ice had moved. They spent 60 hours waiting for rescue, with no shelter and close to no food. The team members remained surprisingly calm, considering that there was a non-zero probability that they would all freeze to death. They were eventually rescued by a helicopter, but had to abandon all their equipment.
There cannot be many other documentary films on castaways, for obvious reasons. But the main interest of the film was to show that science does not always work as planned; it also served as a reminder that field research is not a piece of pie. Another screening, followed by a discussion with Aubin and Charrier, will take place this Saturday at 5.30pm at the Paris École Normale Supérieure.
Omiros Papaspiliopoulos gave a talk yesterday at the Big MC seminar on “Making black boxes out of black boxes – the Bernoulli Factory problem and its applications”, inspired by this paper with Latuszynski et al..
The problem is very easy to state: suppose you have a -coin (i.e. which has probability of landing on Heads), where is unknown. How can you use that coin to simulate from a , where is some function? You can toss the coin as many times as you need.
If you choose the constant function , the problem is quite easy, and from there, you can get any other constant function. The class of functions for which a solution exists is quite large and includes, for example, . However, amazingly, there is no algorithm which works for . This statement provoked some disbelief in the audience, and several of us tried (and failed) to prove Omiros wrong while he was giving details of the proof, which is quite technical.
The second part of the seminar was a nice presentation by Pierre Jacob on Andrieu et al.’s paper on Particle Markov Chain Monte Carlo which was read at the RSS last autumn.