Archive for the ‘Research’ Category

Corcoran medal

17/10/2013 weeks ago, I had the great honour of receiving the 2012 Corcoran memorial medal and prize for my doctoral dissertation. It is awarded by Oxford’s Department of Statistics in memory of Stephen Corcoran, a student who died in 1996 before having time to complete his DPhil. Being a Statistics prize, there is smoothing in the award process: it is awarded every two years, to a DPhil which was completed in the last four years (i.e. between October 2008 and October 2012 in my case). The ceremony was part of the Department’s 25th anniversary celebrations.

Nils Lid Hjort gave a lecture on his “confidence distributions”, a way to represent uncertainty in the non-Bayesian framework. Although he gave examples where his representation seems to work best, I wondered how this could extend to cases where the parameter is not unidimensional.

Chris Yau received the 2010 Corcoran prize and gave a short talk on applications of HMMs togenetic data; he was unlucky to have his 15-minute talk interrupted by a fire alarm (but that allowed me to wonder at how calmly efficient the British are at evacuating in such situations). Luckily, my own talk suffered no such interruption.

Peter Donnelly demonstrated once again his amazing lecturing skills, with a highly informative talk on statistical inference of the history of the UK using genetic data.

All in all, a very enjoyable afternoon, which was followed by a lovely dinner at Somerville College, with several speeches on the past, present and future of Statistics at Oxford.

Thanks again to the Corcoran committe, especially Steffen Lauritzen, for selecting me as the prize winner!

MCMSki 4 session accepted


I juste received news from the organizers of MCMSki 4 that my proposed session on “Advances in Monte Carlo motivated by applications” has been accepted; it will be the first session I chair at a conference.

Three young and talented speakers have accepted to take part in this session, and I am looking forward to hearing what they have to say:

Alexis Muir-Watt will talk about PMCMC advances for the doubly-intractable problem of estimating a partial order, with application to the social order between 12th century bishops.

Lawrence Murray will also discuss PMCMC, for applications in the environmental sciences including marine biogeochemistry, soil carbon modelling and hurricane tracking.

Simon Barthelmé will discuss using quasi-Kronecker matrices to speed up MCMC for functional ANOVA problems, with an application in neurosciences.

All these speakers have been dealing with challenging data sets and models. These applications have led to methodological advances, and the session will at the same time showcase the variety of applications and the way they help BayesComp methodology progress.

MCMSki 4 will be held in Chamonix, January 6-8 2014.

Visiting UCLA


I shall be spending next week (4th-9th March) at the UCLA Department of Linguistics, where I shall give five hours of talks on Statistics for Historical Linguistics: a two-hour review of the field (Thursday morning), a two-hour presentation of my own DPhil work (pdf) (Friday afternoon), and a one-hour tutorial on TraitLab (Friday afternoon?). Abstracts:

1. Statistical methods in Historical Linguistics (Thursday morning)
Recent advances in our understanding of language change, in statistical methodology, and in computational power, along with an increasing wealth of available data, have allowed significant progress in statistical modelling of language change, and quantitative methods are gaining traction in Historical Linguistics. Models have been developed for the change through time of vocabulary, morpho-syntactic and phonetic traits. I shall present a review of these models (from a statistician’s point of view), starting with Morris Swadesh’s failed attempts at glottochronology, then looking at some models developed in the last decade. In parallel, I shall provide brief insights into statistical tools such as Bayesian statistics and Markov Chain Monte Carlo, in order to show how to use these effectively for linguistic applications.
2. A phylogenetic model of language diversification (Friday afternoon)
Language diversification is a random process similar in many ways to biological evolution. We model the diversification of so-called “core” lexical data by a stochastic process on a phylogenetic tree. We initially focus on the Indo-European language family. The age of the most recent common ancestor of these languages is of particular interest and issues of dating ancient languages have been subject to controversy. We use Markov Chain Monte Carlo to estimate the tree topology, internal node ages and model parameters. Our model includes several aspects specific to language diversification, such as rate heterogeneity and the data registration process, and we show that lexical borrowing does not bias our estimates. We show the robustness of our model through extensive validation and analyse two independent data sets to estimates the age of Proto-Indo-European. We then analyse a data set of Semitic languages, and show an extension of our model to explore whether languages evolve in “punctuational bursts”. Finally, we revisit an analysis of several small data sets by Bergsland & Vogt (1962).
Joint work with Geoff Nicholls

3. Tutorial and practical: TraitLab, a package for phylogenies of linguistic and cultural traits
In this tutorial, I shall present how to use the TraitLab package, which was initially developed specifically for the modelling of core vocabulary change through time, and guide interested attendants through an analysis of a simple data set. TraitLab is a software package for simulating, fitting and analysing tree-like binary data under a stochastic Dollo model of evolution. It handles “catastrophic” rate heterogeneity and missing data. The core of the package is a Markov chain Monte Carlo (MCMC) sampling algorithm that enables the user to sample from the Bayesian joint posterior distributions for tree topologies, clade and root ages, and the trait loss and catastrophe rates for a given data set. Data can be simulated according to the fitted Dollo model or according to a number of generalized models that allow for borrowing (horizontal transfer) of traits, heterogeneity in the trait loss rate and biases in the data collection  process. Both the raw data and the output of MCMC runs can be inspected using a number of useful graphical and analytical tools  provided within the package. TraitLab is freely available and runs within the Matlab computing environment.

Attendants who wish to use TraitLab during the practical should have a computer with Matlab installed.

TraitLab was developed jointly with Geoff Nicholls and David Welch.

Accepted: Wang-Landau Flat Histogram


Excellent news to start the week-end: our paper with Pierre Jacob The Wang-Landau Algorithm Reaches the Flat Histogram in Finite Time has been accepted for publication in Annals of Applied Probability.

For details, see this blog post or read the paper on arXiv.

Copyfraud at the CNRS


A CNRS unit, the Institut de l’information scientifique et technique (INIST), is performing blatant copyright misuse of academic articles.

Here is the short version of the problem as I understand it: INIST suscribes to many journals so that CNRS scientists may easily access papers, which is of course perfectly reasonable. However, INIST then uses the fact that it has access to these papers to sell them to non-CNRS readers. This seems to be done without authorisation or remuneration of the original journals and authors. Even papers which are open-access are sold for prices between 15€ and 60€ apiece, and there is no link to the free version on the journal’s website.

INIST has appealed a judgement declaring this illegal.

We are used to copyfraud in general, and to restrictions to open access in particular. But that a government unit is involved is particularly scandalous.

Much more detail is given on the blog “Droits d’auteur” (post 1, post 2) in French. There is also an online petition.

Pierre Jacob’s viva


Pierre Jacob defended his PhD on Monday, which was also the first time I was on the examiner bench at a PhD viva (much less stressful than being the candidate!). Unsurprisingly, the viva went very well, with laudatory reports. His thesis is a collection of 5(!) papers he co-wrote, including Free energy SMC, SMC², Block independent Methopolis-Hastings, Parallel Adaptive Wang-Landau (my personal favourite), and our collaboration on the Wang-Landau Flat Histogram.

I have been impressed not only by the quality of Pierre’s work, but also by his ability to start collaborations with many different researchers all around the world (co-authors in 6 countries already). He is now moving to Singapore for a postdoc with Ajay Jasra and has started a handful of very interesting projects, so there is much more to be expected!

The Wang-Landau algorithm reaches the flat histogram in finite time.


MCMC practitioners may be familiar with the Wang-Landau algorithm, which is widely used in Physics. This algorithm divides the sample space into “boxes”. Given a target distribution, the algorithm then samples proportionally to the target in each box, while aiming at spending a pre-defined proportion of the sample in each box. (Usually these predefined proportions are uniform.)

This strategy can help move faster between modes of a distribution, by forcing the sample to visit often the space between modes.

The most sophisticated versions of this algorithm combine a decreasing stochastic schedule and the so-called flat histogram criterion: whenever the proportions of the sample in each box are close enough to the desired frequencies, the stochastic schedule decreases. A decreasing schedule is necessary for diminishing adaptation to hold.

Until now, it was unknown whether the flat histogram is necessarily reached in finite time, and hence whether the schedule ever starts decreasing.

Pierre Jacob and I just submitted and arXived a proof that the flat histogram is reached in finite time under some conditions, and may never be reached in other cases.

Savage award honourable mention!


The results for the 2010 Savage award, bestowed by the International Society for Bayesian Analysis for PhD theses in Bayesian statistics, just came in. I am very pleased and honoured to have won the honourable mention in the Applied Methodology category! I shall be presenting the work (Phylogenetic Models of Language Diversification) at JSM in Miami in August.

I am very grateful to the award committee, and am as always thankful for the fantastic supervision I received from Geoff Nicholls.

This is a good year for French statistics, since Julien Cornebise won the award in the Theory category! Congratulations to the other winners, Ricardo Lemos and Daniel Williamson.

Interview in La Recherche


La Recherche, one of the leading French popular science magazines, has a monthly piece in which they present a Mathematician’s work in two pages. I had the honour and great pleasure of being featured in this month’s issue. Philippe Pajot interviewed me for over an hour about my work on models of language diversification and did a pretty good job at summarizing it in a few hundred words, with a presentation of the linguistic question, the data, and a mention of the issues of validation and error bars. There is also a brief attempt at explaining MCMC.

A scan is below, for French-speaking readers.

Pour la Science article on Indo-European expansion


Phylogenetic models of language diversification seem to be popular these days in French popular science magazines. Of the leading publications, La Recherche will feature an interview with yours truly in March, and Pour la Science has an 8-page cover story on the subject in the current issue.

Popularizer Ruth Berger looks at the expansion of the Indo-Europeans from genetic and linguistic points of view, trying to reconcile them and to decide between the Kurgan (horsemen) and Anatolian (farmers) possible origins of Indo-European expansion. For the linguistics half, she looks at phylogenetic models to infer genealogies and dates, but skips the methodology and reproduces directly trees by Gray & Atkinson (2003) and Atkinson et al. (2005).

It is a shame that the method is presented as a black box. Given the length of the article, it would have been possible to give a general idea of how dates are inferred: the ages of parts of the tree are known, and this information is used to estimate rates of change and other ages. Instead, the author suggests that the rates are already known [whence?] and are fed to the black box, which magically outputs a tree and dates.  There is barely anything about the uncertainty of the estimates, and nothing about validation. I also have trouble understanding the points made at the end about Linear A and the attempt to merge the Anatolian and Kurgan hypotheses.

This issue is number 400 of Pour la Science. According to the editor-in-chief, they decided to celebrate with an issue on “theories and models”. Indo-European expansion is one of their examples, along with pieces on Grand Unification and on gene transfers. Uncertainty and validation are major parts of any decent modelling endeavour, and it is a shame that they did not seize the opportunity to educate their readership about these issues.

I suppose it is hardly surprising that I am disappointed with a popular science paper on a topic related to my PhD…