## MCMSki 4 session accepted

11/04/2013

I juste received news from the organizers of MCMSki 4 that my proposed session on “Advances in Monte Carlo motivated by applications” has been accepted; it will be the first session I chair at a conference.

Three young and talented speakers have accepted to take part in this session, and I am looking forward to hearing what they have to say:

Alexis Muir-Watt will talk about PMCMC advances for the doubly-intractable problem of estimating a partial order, with application to the social order between 12th century bishops.

Lawrence Murray will also discuss PMCMC, for applications in the environmental sciences including marine biogeochemistry, soil carbon modelling and hurricane tracking.

Simon Barthelmé will discuss using quasi-Kronecker matrices to speed up MCMC for functional ANOVA problems, with an application in neurosciences.

All these speakers have been dealing with challenging data sets and models. These applications have led to methodological advances, and the session will at the same time showcase the variety of applications and the way they help BayesComp methodology progress.

MCMSki 4 will be held in Chamonix, January 6-8 2014.

## Visiting UCLA

28/02/2013

I shall be spending next week (4th-9th March) at the UCLA Department of Linguistics, where I shall give five hours of talks on Statistics for Historical Linguistics: a two-hour review of the field (Thursday morning), a two-hour presentation of my own DPhil work (pdf) (Friday afternoon), and a one-hour tutorial on TraitLab (Friday afternoon?). Abstracts:

1. Statistical methods in Historical Linguistics (Thursday morning)
Recent advances in our understanding of language change, in statistical methodology, and in computational power, along with an increasing wealth of available data, have allowed significant progress in statistical modelling of language change, and quantitative methods are gaining traction in Historical Linguistics. Models have been developed for the change through time of vocabulary, morpho-syntactic and phonetic traits. I shall present a review of these models (from a statistician’s point of view), starting with Morris Swadesh’s failed attempts at glottochronology, then looking at some models developed in the last decade. In parallel, I shall provide brief insights into statistical tools such as Bayesian statistics and Markov Chain Monte Carlo, in order to show how to use these effectively for linguistic applications.
2. A phylogenetic model of language diversification (Friday afternoon)
Language diversification is a random process similar in many ways to biological evolution. We model the diversification of so-called “core” lexical data by a stochastic process on a phylogenetic tree. We initially focus on the Indo-European language family. The age of the most recent common ancestor of these languages is of particular interest and issues of dating ancient languages have been subject to controversy. We use Markov Chain Monte Carlo to estimate the tree topology, internal node ages and model parameters. Our model includes several aspects specific to language diversification, such as rate heterogeneity and the data registration process, and we show that lexical borrowing does not bias our estimates. We show the robustness of our model through extensive validation and analyse two independent data sets to estimates the age of Proto-Indo-European. We then analyse a data set of Semitic languages, and show an extension of our model to explore whether languages evolve in “punctuational bursts”. Finally, we revisit an analysis of several small data sets by Bergsland & Vogt (1962).
Joint work with Geoff Nicholls

3. Tutorial and practical: TraitLab, a package for phylogenies of linguistic and cultural traits
In this tutorial, I shall present how to use the TraitLab package, which was initially developed specifically for the modelling of core vocabulary change through time, and guide interested attendants through an analysis of a simple data set. TraitLab is a software package for simulating, fitting and analysing tree-like binary data under a stochastic Dollo model of evolution. It handles “catastrophic” rate heterogeneity and missing data. The core of the package is a Markov chain Monte Carlo (MCMC) sampling algorithm that enables the user to sample from the Bayesian joint posterior distributions for tree topologies, clade and root ages, and the trait loss and catastrophe rates for a given data set. Data can be simulated according to the fitted Dollo model or according to a number of generalized models that allow for borrowing (horizontal transfer) of traits, heterogeneity in the trait loss rate and biases in the data collection  process. Both the raw data and the output of MCMC runs can be inspected using a number of useful graphical and analytical tools  provided within the package. TraitLab is freely available and runs within the Matlab computing environment.

Attendants who wish to use TraitLab during the practical should have a computer with Matlab installed.

TraitLab was developed jointly with Geoff Nicholls and David Welch.

## Bayesian model choice for the Poisson model

01/02/2013

Following Arthur Charpentier‘s example, I am going to try to post occasionally on material covered during my courses, in the hope that it might be useful to my students, but also to others.

In the second practical of the Bayesian Case Studies course, we looked at Bayesian model choice and basic Monte Carlo methods, looking at data about the number of oral questions asked by French deputies (Members of Parliament). We wanted to know whether male and female deputies behave differently in this matter. If we model the number of questions $Y_i$ asked by a deputy by a Poisson random variable and denote $Z_i$ their gender (1 for men, 2 for women), we thus need to choose between the following two models. (See the course website for the question sheet and other material; the slides are below.)

Model 1: $Y_i\sim Poisson(\lambda); \quad \lambda \sim \Gamma(a,b)$

Model 2: $Y_i|Z_i=k\sim Poisson(\lambda_k); \quad \lambda_1\sim \Gamma(a,b); \quad \lambda_2\sim \Gamma(a,b)$

 > deputes=read.csv2('http://www.ceremade.dauphine.fr/~ryder/CaseStudies/deputes.csv')
> attach(deputes)#Q1
> summary(questions_orales)
> hist(questions_orales,breaks=(0:6-.5))
> (n=length(questions_orales))
> (nh=sum(sexe=="H"))
> (nf=sum(sexe=="F"))
> (qtot=sum(questions_orales))
> (qh=sum(questions_orales[sexe=="H"]))
> (qf=sum(questions_orales[sexe=="F"]))

Since the Gamma prior is a conjugate prior for the Poisson model, we get a Gamma distribution as our posterior. Using a $\Gamma(2,2)$ as prior, here is a comparison of the prior and the posteriors for the two models (in the bottom plot, $\lambda_1$ is in black and $\lambda_2$ in red.

Two observations here: First, the posterior is much more peaked than the prior, which is expected since the number of data points is large. Second, in model 2, it seems that there may be a difference between men and women, with women asking slightly less questions in Parliament.To verify this, we compute a Bayes factor to compare the two models. In this case, the Bayes factor is available in closed form, but an initial computation does not work because $n$ is too large:

> BF_analytical=function(a,b,y1,n1,y2,n2){
+   m1=b^a/gamma(a)*gamma(a+y1+y2)/(b+n1+n2)^(a+y1+y2)
+   m2=b^(2*a)/gamma(a)^2*gamma(a+y1)/(b+n1)^(a+y1)*gamma(a+y2)/(b+n2)^(a+y2)
+   return(m2/m1)
+ }
> BF_analytical(2,2,qh,nh,qf,nf)
[1] NaN
Warning messages:
1: In BF_analytical(2, 2, qh, nh, qf, nf) :
value out of range in 'gammafn'
2: In BF_analytical(2, 2, qh, nh, qf, nf) :
value out of range in 'gammafn'

However, we are able to compute the Bayes factor analytically by using the log-scale instead.

> BF_analytical2=function(a,b,y1,n1,y2,n2){
+   m1=a*log(b)-lgamma(a)+lgamma(a+y1+y2)-(a+y1+y2)*log(b+n1+n2)
+   m2=2*a*log(b)-2*lgamma(a)+lgamma(a+y1)-(a+y1)*log(b+n1)+lgamma(a+y2)-(a+y2)*log(b+n2)
+   return(exp(m2-m1))
+ }
> BF_analytical2(2,2,qh,nh,qf,nf)
[1] 0.1875569
> log10(BF_analytical2(2,2,qh,nh,qf,nf))
[1] -0.726867

Using Jeffrey’s scale of evidence, this corresponds to “substantial” evidence in favour of model 1 (i.e. no difference between men and women).

Even if we had not been able to calculate the Bayes factor exactly, we could have approximated it using Monte Carlo. We first create two functions to calculate the likelihood under model 1 and model 2, and start with vanilla Monte Carlo.

> lkd.model1=function(y,n,lambda){  return(exp(-n*lambda+y*log(lambda))) }
> lkd.model2=function(y1,n1,y2,n2,lambda1,lambda2){  return(exp(-n1*lambda1+y1*log(lambda1)-n2*lambda2+y2*log(lambda2))) }

> BF_MC=function(a,b,y1,n1,y2,n2,M){
+  lambda1=rgamma(M,a,b)
+  m1=cumsum(lkd.model1(y1+y2,n1+n2,lambda1))/(1:M)
+ lambda2.1=rgamma(M,a,b)
+ lambda2.2=rgamma(M,a,b)
+  m2=cumsum(lkd.model2(y1,n1,y2,n2,lambda2.1,lambda2.2))/(1:M)
+  return(m2/m1)
+ }

> M=100000
> BF_MC(2,2,qh,nh,qf,nf,M)[M]
[1] 0.1776931
> plot(100:M,BF_MC(2,2,qh,nh,qf,nf,M)[100:M], type="l")

The last line creates a plot of how our Monte Carlo estimate of the Bayes factor changes as the number of iterations $M$ increases. It is useful to check convergence.

Here, it is clear that the method converges after a few tens of thousands iterations. Our estimate for $M=10^5$ is close to the true value (two decimal places precision). Next, we try the harmonic mean method:

> BF_HM=function(a,b,y1,n1,y2,n2,M){
+   lambda1=rgamma(M,a+y1+y2,b+n1+n2)
+   m1=(1:M)/(cumsum(1/lkd.model1(y1+y2,n1+n2,lambda1)))
+  lambda2.1=rgamma(M,a+y1,b+n1)
+  lambda2.2=rgamma(M,a+y2,b+n2)
+  m2=(1:M)/cumsum(1/lkd.model2(y1,n1,y2,n2,lambda2.1,lambda2.2))
+   return(m2/m1)
+ }

> BF_HM(2,2,qh,nh,qf,nf,M)[M]
[1] 0.2451013
> plot(100:M,BF_HM(2,2,qh,nh,qf,nf,M)[100:M], type="l")


We observe that this does not appear to converge. It is known that in general, the harmonic mean method is unreliable.

We finally move on to importance sampling. In importance sampling, we want to estimate $E[h(X)]$ where $X\sim f$ by using simulations from an alternate density $g$. This will be most efficient when $g$ is close to $h\times f$. In this case, we therefore need $g$ to be similar to the posterior distribution, so we take a normal density with mean and variance the same as the posterior distribution:

> BF_IS=function(a,b,y1,n1,y2,n2,M){
+  mean1=(a+y1+y2)/(b+n1+n2)
+  mean2.1=(a+y1)/(b+n1)
+  mean2.2=(a+y2)/(b+n2)

+  sigma1=sqrt((a+y1+y2)/(b+n1+n2)^2)
+  sigma2.1=sqrt((a+y1)/(b+n1)^2)
+  sigma2.2=sqrt((a+y2)/(b+n2)^2)

+  lambda1=rnorm(M,mean1,sigma1)
+  m1=cumsum( lkd.model1(y1+y2,n1+n2,lambda1) * dgamma(lambda1,a, b) / dnorm(lambda1,mean1,sigma1))/(1:M)

+  lambda2.1=rnorm(M,mean2.1, sigma2.1)
+  lambda2.2=rnorm(M,mean2.2, sigma2.2)
+  m2=cumsum( lkd.model2(y1,n1,y2,n2,lambda2.1,lambda2.2) * dgamma(lambda2.1,a, b) * dgamma(lambda2.2,a,b) /
+    (dnorm(lambda2.1,mean2.1,sigma2.1)*dnorm(lambda2.2,mean2.2,sigma2.2))) / (1:M)

+  return(m2/m1)
+ }

> BF_IS(2,2,qh,nh,qf,nf,M)[M]
[1] 0.1875875
> plot(100:M,BF_IS(2,2,qh,nh,qf,nf,M)[100:M], type="l")

This method converges. Note that the y-scale is much tighter than for vanilla Monte Carlo: convergence is in fact much faster, and we get a more precise estimate than by using vanilla Monte Carlo. This is more obvious if we look at both methods on the same graph.

> plot(100:M,BF_MC(2,2,qh,nh,qf,nf,M)[100:M], type="l")
> lines(100:M,BF_IS(2,2,qh,nh,qf,nf,M)[100:M], col="red")


It is much more visible now that importance sampling is a clear winner.

With this computation of the Bayes factor, we choose model 1: a single parameter can be used to model the behaviour of men and women.

23/01/2013

I make many resolutions, and keep hardly any on the long term, which I think is very common. For about two years, I have been using Beeminder, a tool developed by the recently mentioned Dan Reeves, which is fantastic for number-inclined people (which includes most readers of this blog, presumably).

For example, I have a stack of academic papers to read which keeps growing. I have set the (modest) goal to read at least 2 of those papers per week on average, i.e. that I read 104 papers in a year; others might want to lose 10 kg by July, or get their number of unread e-mails back to 0 by next month. Usually, such resolutions have the following effect: during the first couple of weeks, I will indeed read 2 papers/week; in week 3, I’ll read only 1, but think that I’ll catch up the next week, which I never do; by week 4, I’ve got so much catching up to do that I might as well give up on the resolution.

Beeminder helps check that I stay on track constantly. On any given day, I must be on the “yellow brick road” which leads to my goal, and which allows some leeway based on the variance of data entered up until now. A big goal in the far future is thus transformed into a small goal in the near future. In the graph below, the number at the top left is the number of days I have before I must read another paper to avoid derailing.

If you need an extra incentive, you can pledge money, which is only charged in you fail at your goal (you are allowed to fail your goal once for free; if you fail and want to start again, you need to promise to pay should you fail a second time).

This is all explained in further detail on Beeminder’s blog. Apart from academic papers to read, I am reading how much I swim, boring admin tasks, and several other goals I would not keep otherwise, and I have found Beeminder to be a great way to achieve these goals. Give it a try!

## Accepted: Wang-Landau Flat Histogram

14/12/2012

Excellent news to start the week-end: our paper with Pierre Jacob The Wang-Landau Algorithm Reaches the Flat Histogram in Finite Time has been accepted for publication in Annals of Applied Probability.

For details, see this blog post or read the paper on arXiv.

## Contigency exigency

14/12/2012

The SIGecom Exchanges, an engineering and economics journal on e-commerce, run a puzzle in each issue, written by Daniel Reeves (more on Dan’s awesomeness soon). The December 2011 puzzle asked to find a scheme to pay a lawyer when there is uncertainty in the fruits of their labour, while still taking into account the amount of labour (see here for the complete puzzle).

After 6 months without a solution, Dan offered a bounty for this puzzle, where the award would be a realization of a uniform draw between 0 and 500 dollars (realization given by geohashing). After some extra nerdery to choose the winner of the bounty between Arthur Breitman and myself, I ended up winning \$65.14, and my solution appears in the latest issue of the SIGecom Exchanges.

This was great fun; other journals should pick up the idea!

## Copyfraud at the CNRS

02/11/2012

A CNRS unit, the Institut de l’information scientifique et technique (INIST), is performing blatant copyright misuse of academic articles.

Here is the short version of the problem as I understand it: INIST suscribes to many journals so that CNRS scientists may easily access papers, which is of course perfectly reasonable. However, INIST then uses the fact that it has access to these papers to sell them to non-CNRS readers. This seems to be done without authorisation or remuneration of the original journals and authors. Even papers which are open-access are sold for prices between 15€ and 60€ apiece, and there is no link to the free version on the journal’s website.

INIST has appealed a judgement declaring this illegal.

We are used to copyfraud in general, and to restrictions to open access in particular. But that a government unit is involved is particularly scandalous.

Much more detail is given on the blog “Droits d’auteur” (post 1, post 2) in French. There is also an online petition.

## Pierre Jacob’s viva

07/09/2012

Pierre Jacob defended his PhD on Monday, which was also the first time I was on the examiner bench at a PhD viva (much less stressful than being the candidate!). Unsurprisingly, the viva went very well, with laudatory reports. His thesis is a collection of 5(!) papers he co-wrote, including Free energy SMC, SMC², Block independent Methopolis-Hastings, Parallel Adaptive Wang-Landau (my personal favourite), and our collaboration on the Wang-Landau Flat Histogram.

I have been impressed not only by the quality of Pierre’s work, but also by his ability to start collaborations with many different researchers all around the world (co-authors in 6 countries already). He is now moving to Singapore for a postdoc with Ajay Jasra and has started a handful of very interesting projects, so there is much more to be expected!

## PLoS scientific paper gets also published on Wikipedia

21/05/2012

Wikipedia is awesome, and we all rely on it on a day-to-day basis. For basic statistical topics, I usually find it very reliable. For more advanced topics, however, coverage is often sketchy or even non-existent. The article on ABC is exceedingly short; the one on the Wang-Landau algorithm is rather odd; the on on quantitative comparative linguistics is essentially a long list.

I often get annoyed when people complain about Wikipedia, for two reasons: 1. Although imperfect, it is pretty good, and often better than other more widely accepted general knowledge encyclopedias; 2. If you find a problem, you should fix it rather than complain.

Although I am far from being the most active contributor on Wikipedia, I edit it quite a lot, and am an administrator of Wikimedia Commons. But I have hardly done any contributions in Statistics, my area of expertise. The reason is simple: when I do Statistics, it is for my job, do I either do research or prepare lectures. Editing Wikipedia is a hobby, so I want it to be separate from my job.

But this is sub-optimal: Wikipedia could be improved if more specialists were to participate. Also, and I should try not to be too depressed by this, anything I write on Wikipedia will probably be read thousands time more than the scientific papers I write. One could thus argue that it would be an efficient use of my time for me and other statisticians to edit Wikipedia on Statistics topics. One could even go further and state that it should be one of our duties.

There are several obstacles here. The first obstacle is that Wikipedia’s model is that it can be edited by anyone; other encyclopedias which attempt to have articles written by specialists are nowhere near as successful (e.g. Citizendium). Some kind of assurance that an expert’s edits are given more importance than a layman’s could incentivize more experts to participate. I actually believe that in the large majority of cases, an expert’s contribution would be recognized as such and that no one would try to deconstruct it (most exceptions would be in topics subject to hot debate either in the scientific community or in the media). A second obstacle is that editing Wikipedia is not recognized in the career path of scientists: I might want scientists in general to participate in Wikipedia; but I won’t do it, because I really need to finish these three papers and grant proposals (a kind of NIMBY issue). A third obstacle is that scientists simply cannot be bothered to edit Wikipedia.

PLoS Computational Biology has started an initiative to address these obstacles. In a nutshell: scientists write a review paper. It gets published in the journal, and the scientists get a publication on their CV. At the same time, the paper gets copied over to Wikipedia. The initial version is marked as being written by specialists, but anyone can edit it (e.g. improve the wording, add relevant links, etc.). Everyone wins. The first such article is on Circular permutation in proteins.

Obviously, scientists should also edit Wikipedia even when they don’t get a publication out of it. But such initiatives will help get scientists involved. Explaining scientific topics to the general public is one of the duties of scientists, and Wikipedia is one of the best tools to do this.

## The Wang-Landau algorithm reaches the flat histogram in finite time.

19/10/2011

MCMC practitioners may be familiar with the Wang-Landau algorithm, which is widely used in Physics. This algorithm divides the sample space into “boxes”. Given a target distribution, the algorithm then samples proportionally to the target in each box, while aiming at spending a pre-defined proportion of the sample in each box. (Usually these predefined proportions are uniform.)

This strategy can help move faster between modes of a distribution, by forcing the sample to visit often the space between modes.

The most sophisticated versions of this algorithm combine a decreasing stochastic schedule and the so-called flat histogram criterion: whenever the proportions of the sample in each box are close enough to the desired frequencies, the stochastic schedule decreases. A decreasing schedule is necessary for diminishing adaptation to hold.

Until now, it was unknown whether the flat histogram is necessarily reached in finite time, and hence whether the schedule ever starts decreasing.

Pierre Jacob and I just submitted and arXived a proof that the flat histogram is reached in finite time under some conditions, and may never be reached in other cases.