Bayesian model choice for the Poisson model


Following Arthur Charpentier‘s example, I am going to try to post occasionally on material covered during my courses, in the hope that it might be useful to my students, but also to others.

In the second practical of the Bayesian Case Studies course, we looked at Bayesian model choice and basic Monte Carlo methods, looking at data about the number of oral questions asked by French deputies (Members of Parliament). We wanted to know whether male and female deputies behave differently in this matter. If we model the number of questions Y_i asked by a deputy by a Poisson random variable and denote Z_i their gender (1 for men, 2 for women), we thus need to choose between the following two models. (See the course website for the question sheet and other material; the slides are below.)

Slides2 from Robin Ryder

Model 1: Y_i\sim Poisson(\lambda); \quad \lambda \sim \Gamma(a,b)

Model 2: Y_i|Z_i=k\sim Poisson(\lambda_k); \quad \lambda_1\sim \Gamma(a,b); \quad \lambda_2\sim \Gamma(a,b)

First download the data and set some variables:

 > deputes=read.csv2('')
 > attach(deputes)#Q1
 > summary(questions_orales)
 > hist(questions_orales,breaks=(0:6-.5))
 > (n=length(questions_orales))
 > (nh=sum(sexe=="H"))
 > (nf=sum(sexe=="F"))
 > (qtot=sum(questions_orales))
 > (qh=sum(questions_orales[sexe=="H"]))
 > (qf=sum(questions_orales[sexe=="F"]))

Since the Gamma prior is a conjugate prior for the Poisson model, we get a Gamma distribution as our posterior. Using a \Gamma(2,2) as prior, here is a comparison of the prior and the posteriors for the two models (in the bottom plot, \lambda_1 is in black and \lambda_2 in red.


Two observations here: First, the posterior is much more peaked than the prior, which is expected since the number of data points is large. Second, in model 2, it seems that there may be a difference between men and women, with women asking slightly less questions in Parliament.To verify this, we compute a Bayes factor to compare the two models. In this case, the Bayes factor is available in closed form, but an initial computation does not work because n is too large:

> BF_analytical=function(a,b,y1,n1,y2,n2){
+   m1=b^a/gamma(a)*gamma(a+y1+y2)/(b+n1+n2)^(a+y1+y2)
+   m2=b^(2*a)/gamma(a)^2*gamma(a+y1)/(b+n1)^(a+y1)*gamma(a+y2)/(b+n2)^(a+y2)
+   return(m2/m1)
+ }
> BF_analytical(2,2,qh,nh,qf,nf)
[1] NaN
Warning messages:
1: In BF_analytical(2, 2, qh, nh, qf, nf) :
  value out of range in 'gammafn'
2: In BF_analytical(2, 2, qh, nh, qf, nf) :
  value out of range in 'gammafn'

However, we are able to compute the Bayes factor analytically by using the log-scale instead.

> BF_analytical2=function(a,b,y1,n1,y2,n2){
+   m1=a*log(b)-lgamma(a)+lgamma(a+y1+y2)-(a+y1+y2)*log(b+n1+n2)
+   m2=2*a*log(b)-2*lgamma(a)+lgamma(a+y1)-(a+y1)*log(b+n1)+lgamma(a+y2)-(a+y2)*log(b+n2)
+   return(exp(m2-m1))
+ }
> BF_analytical2(2,2,qh,nh,qf,nf)
[1] 0.1875569
> log10(BF_analytical2(2,2,qh,nh,qf,nf))
[1] -0.726867

Using Jeffrey’s scale of evidence, this corresponds to “substantial” evidence in favour of model 1 (i.e. no difference between men and women).

Even if we had not been able to calculate the Bayes factor exactly, we could have approximated it using Monte Carlo. We first create two functions to calculate the likelihood under model 1 and model 2, and start with vanilla Monte Carlo.

> lkd.model1=function(y,n,lambda){  return(exp(-n*lambda+y*log(lambda))) }
> lkd.model2=function(y1,n1,y2,n2,lambda1,lambda2){  return(exp(-n1*lambda1+y1*log(lambda1)-n2*lambda2+y2*log(lambda2))) }

> BF_MC=function(a,b,y1,n1,y2,n2,M){
 +  lambda1=rgamma(M,a,b)
 +  m1=cumsum(lkd.model1(y1+y2,n1+n2,lambda1))/(1:M)
 + lambda2.1=rgamma(M,a,b)
 + lambda2.2=rgamma(M,a,b)
 +  m2=cumsum(lkd.model2(y1,n1,y2,n2,lambda2.1,lambda2.2))/(1:M)
 +  return(m2/m1)
 + }

> M=100000
> BF_MC(2,2,qh,nh,qf,nf,M)[M]
[1] 0.1776931
> plot(100:M,BF_MC(2,2,qh,nh,qf,nf,M)[100:M], type="l")

The last line creates a plot of how our Monte Carlo estimate of the Bayes factor changes as the number of iterations M increases. It is useful to check convergence.vanilla

Here, it is clear that the method converges after a few tens of thousands iterations. Our estimate for M=10^5 is close to the true value (two decimal places precision). Next, we try the harmonic mean method:

> BF_HM=function(a,b,y1,n1,y2,n2,M){
+   lambda1=rgamma(M,a+y1+y2,b+n1+n2)
+   m1=(1:M)/(cumsum(1/lkd.model1(y1+y2,n1+n2,lambda1)))
 +  lambda2.1=rgamma(M,a+y1,b+n1)
 +  lambda2.2=rgamma(M,a+y2,b+n2)
 +  m2=(1:M)/cumsum(1/lkd.model2(y1,n1,y2,n2,lambda2.1,lambda2.2))
+   return(m2/m1)  
+ }

> BF_HM(2,2,qh,nh,qf,nf,M)[M]
[1] 0.2451013
> plot(100:M,BF_HM(2,2,qh,nh,qf,nf,M)[100:M], type="l")

We observe that this does not appear to converge. It is known that in general, the harmonic mean method is unreliable.

We finally move on to importance sampling. In importance sampling, we want to estimate E[h(X)] where X\sim f by using simulations from an alternate density g. This will be most efficient when g is close to $h\times f$. In this case, we therefore need g to be similar to the posterior distribution, so we take a normal density with mean and variance the same as the posterior distribution:

> BF_IS=function(a,b,y1,n1,y2,n2,M){
+  mean1=(a+y1+y2)/(b+n1+n2)
+  mean2.1=(a+y1)/(b+n1)
+  mean2.2=(a+y2)/(b+n2)

+  sigma1=sqrt((a+y1+y2)/(b+n1+n2)^2)
+  sigma2.1=sqrt((a+y1)/(b+n1)^2)
+  sigma2.2=sqrt((a+y2)/(b+n2)^2)

+  lambda1=rnorm(M,mean1,sigma1)
+  m1=cumsum( lkd.model1(y1+y2,n1+n2,lambda1) * dgamma(lambda1,a, b) / dnorm(lambda1,mean1,sigma1))/(1:M)

+  lambda2.1=rnorm(M,mean2.1, sigma2.1)
+  lambda2.2=rnorm(M,mean2.2, sigma2.2)
+  m2=cumsum( lkd.model2(y1,n1,y2,n2,lambda2.1,lambda2.2) * dgamma(lambda2.1,a, b) * dgamma(lambda2.2,a,b) / 
+    (dnorm(lambda2.1,mean2.1,sigma2.1)*dnorm(lambda2.2,mean2.2,sigma2.2))) / (1:M)

+  return(m2/m1)
+ }

> BF_IS(2,2,qh,nh,qf,nf,M)[M]
[1] 0.1875875
> plot(100:M,BF_IS(2,2,qh,nh,qf,nf,M)[100:M], type="l")

ISThis method converges. Note that the y-scale is much tighter than for vanilla Monte Carlo: convergence is in fact much faster, and we get a more precise estimate than by using vanilla Monte Carlo. This is more obvious if we look at both methods on the same graph.

> plot(100:M,BF_MC(2,2,qh,nh,qf,nf,M)[100:M], type="l") 
> lines(100:M,BF_IS(2,2,qh,nh,qf,nf,M)[100:M], col="red")

It is much more visible now that importance sampling is a clear winner.

With this computation of the Bayes factor, we choose model 1: a single parameter can be used to model the behaviour of men and women.

Beeminder will help you keep your resolutions


I make many resolutions, and keep hardly any on the long term, which I think is very common. For about two years, I have been using Beeminder, a tool developed by the recently mentioned Dan Reeves, which is fantastic for number-inclined people (which includes most readers of this blog, presumably).

For example, I have a stack of academic papers to read which keeps growing. I have set the (modest) goal to read at least 2 of those papers per week on average, i.e. that I read 104 papers in a year; others might want to lose 10 kg by July, or get their number of unread e-mails back to 0 by next month. Usually, such resolutions have the following effect: during the first couple of weeks, I will indeed read 2 papers/week; in week 3, I’ll read only 1, but think that I’ll catch up the next week, which I never do; by week 4, I’ve got so much catching up to do that I might as well give up on the resolution.

Beeminder helps check that I stay on track constantly. On any given day, I must be on the “yellow brick road” which leads to my goal, and which allows some leeway based on the variance of data entered up until now. A big goal in the far future is thus transformed into a small goal in the near future. In the graph below, the number at the top left is the number of days I have before I must read another paper to avoid derailing.

If you need an extra incentive, you can pledge money, which is only charged in you fail at your goal (you are allowed to fail your goal once for free; if you fail and want to start again, you need to promise to pay should you fail a second time).

This is all explained in further detail on Beeminder’s blog. Apart from academic papers to read, I am reading how much I swim, boring admin tasks, and several other goals I would not keep otherwise, and I have found Beeminder to be a great way to achieve these goals. Give it a try!

Accepted: Wang-Landau Flat Histogram


Excellent news to start the week-end: our paper with Pierre Jacob The Wang-Landau Algorithm Reaches the Flat Histogram in Finite Time has been accepted for publication in Annals of Applied Probability.

For details, see this blog post or read the paper on arXiv.

Contigency exigency


The SIGecom Exchanges, an engineering and economics journal on e-commerce, run a puzzle in each issue, written by Daniel Reeves (more on Dan’s awesomeness soon). The December 2011 puzzle asked to find a scheme to pay a lawyer when there is uncertainty in the fruits of their labour, while still taking into account the amount of labour (see here for the complete puzzle).

After 6 months without a solution, Dan offered a bounty for this puzzle, where the award would be a realization of a uniform draw between 0 and 500 dollars (realization given by geohashing). After some extra nerdery to choose the winner of the bounty between Arthur Breitman and myself, I ended up winning $65.14, and my solution appears in the latest issue of the SIGecom Exchanges.

This was great fun; other journals should pick up the idea!

Copyfraud at the CNRS


A CNRS unit, the Institut de l’information scientifique et technique (INIST), is performing blatant copyright misuse of academic articles.

Here is the short version of the problem as I understand it: INIST suscribes to many journals so that CNRS scientists may easily access papers, which is of course perfectly reasonable. However, INIST then uses the fact that it has access to these papers to sell them to non-CNRS readers. This seems to be done without authorisation or remuneration of the original journals and authors. Even papers which are open-access are sold for prices between 15€ and 60€ apiece, and there is no link to the free version on the journal’s website.

INIST has appealed a judgement declaring this illegal.

We are used to copyfraud in general, and to restrictions to open access in particular. But that a government unit is involved is particularly scandalous.

Much more detail is given on the blog “Droits d’auteur” (post 1, post 2) in French. There is also an online petition.

Pierre Jacob’s viva


Pierre Jacob defended his PhD on Monday, which was also the first time I was on the examiner bench at a PhD viva (much less stressful than being the candidate!). Unsurprisingly, the viva went very well, with laudatory reports. His thesis is a collection of 5(!) papers he co-wrote, including Free energy SMC, SMC², Block independent Methopolis-Hastings, Parallel Adaptive Wang-Landau (my personal favourite), and our collaboration on the Wang-Landau Flat Histogram.

I have been impressed not only by the quality of Pierre’s work, but also by his ability to start collaborations with many different researchers all around the world (co-authors in 6 countries already). He is now moving to Singapore for a postdoc with Ajay Jasra and has started a handful of very interesting projects, so there is much more to be expected!

PLoS scientific paper gets also published on Wikipedia


Wikipedia is awesome, and we all rely on it on a day-to-day basis. For basic statistical topics, I usually find it very reliable. For more advanced topics, however, coverage is often sketchy or even non-existent. The article on ABC is exceedingly short; the one on the Wang-Landau algorithm is rather odd; the on on quantitative comparative linguistics is essentially a long list.

I often get annoyed when people complain about Wikipedia, for two reasons: 1. Although imperfect, it is pretty good, and often better than other more widely accepted general knowledge encyclopedias; 2. If you find a problem, you should fix it rather than complain.

Although I am far from being the most active contributor on Wikipedia, I edit it quite a lot, and am an administrator of Wikimedia Commons. But I have hardly done any contributions in Statistics, my area of expertise. The reason is simple: when I do Statistics, it is for my job, do I either do research or prepare lectures. Editing Wikipedia is a hobby, so I want it to be separate from my job.

But this is sub-optimal: Wikipedia could be improved if more specialists were to participate. Also, and I should try not to be too depressed by this, anything I write on Wikipedia will probably be read thousands time more than the scientific papers I write. One could thus argue that it would be an efficient use of my time for me and other statisticians to edit Wikipedia on Statistics topics. One could even go further and state that it should be one of our duties.

There are several obstacles here. The first obstacle is that Wikipedia’s model is that it can be edited by anyone; other encyclopedias which attempt to have articles written by specialists are nowhere near as successful (e.g. Citizendium). Some kind of assurance that an expert’s edits are given more importance than a layman’s could incentivize more experts to participate. I actually believe that in the large majority of cases, an expert’s contribution would be recognized as such and that no one would try to deconstruct it (most exceptions would be in topics subject to hot debate either in the scientific community or in the media). A second obstacle is that editing Wikipedia is not recognized in the career path of scientists: I might want scientists in general to participate in Wikipedia; but I won’t do it, because I really need to finish these three papers and grant proposals (a kind of NIMBY issue). A third obstacle is that scientists simply cannot be bothered to edit Wikipedia.

PLoS Computational Biology has started an initiative to address these obstacles. In a nutshell: scientists write a review paper. It gets published in the journal, and the scientists get a publication on their CV. At the same time, the paper gets copied over to Wikipedia. The initial version is marked as being written by specialists, but anyone can edit it (e.g. improve the wording, add relevant links, etc.). Everyone wins. The first such article is on Circular permutation in proteins.

Obviously, scientists should also edit Wikipedia even when they don’t get a publication out of it. But such initiatives will help get scientists involved. Explaining scientific topics to the general public is one of the duties of scientists, and Wikipedia is one of the best tools to do this.

The Wang-Landau algorithm reaches the flat histogram in finite time.


MCMC practitioners may be familiar with the Wang-Landau algorithm, which is widely used in Physics. This algorithm divides the sample space into “boxes”. Given a target distribution, the algorithm then samples proportionally to the target in each box, while aiming at spending a pre-defined proportion of the sample in each box. (Usually these predefined proportions are uniform.)

This strategy can help move faster between modes of a distribution, by forcing the sample to visit often the space between modes.

The most sophisticated versions of this algorithm combine a decreasing stochastic schedule and the so-called flat histogram criterion: whenever the proportions of the sample in each box are close enough to the desired frequencies, the stochastic schedule decreases. A decreasing schedule is necessary for diminishing adaptation to hold.

Until now, it was unknown whether the flat histogram is necessarily reached in finite time, and hence whether the schedule ever starts decreasing.

Pierre Jacob and I just submitted and arXived a proof that the flat histogram is reached in finite time under some conditions, and may never be reached in other cases.

New position


As of this month, I am no longer a postdoc, but an assistant professor (maître de conférences) at Université Paris Dauphine.

My new contact details are:

Robin Ryder
Bureau C608
CEREMADE − Université Paris Dauphine
Place du Maréchal de Lattre de Tassigny
75116 Paris
Phone: (+33) 1 44 05 46 71

My e-mail is unchanged.

Greek Stochastics γ


The Greek Stochastics γ conference was held last week in Crete and focused this year on MCMC methods (last year was of statistics for biology). The schedule allowed little time for presentations from participants, but there were several “short courses”. In particular, Gareth Roberts gave a rather illuminating explanation on convergence of Gibbs samplers, in which each step of the sampler is viewed as a projection in a functional space.

In a Gibbs sampler with two steps (1. update X given \theta; 2. update \theta given X), step 1 is a projection (recall that P is a projection iff P^2=P, and it is clear than performing step 1 twice in a row is the same as performing it only once); the same is true of step 2. The spaces we are projecting on are ugly, but if you view them as lines, it is easy to see that the successive iterations converge to the intersection of the two lines, which corresponds to a fixed point, as shown in this very ugly figure. The convergence is faster if the two lines are  close to orthogonal, which corresponds to correlation close to 0 in the Gibbs sampler. Apparently, this line of thought was initiated by Yael Amit. I find it very helpful.

Another fascinating talk was given by David Spiegelhalter (the “general interest” talk), on communicating in Statistics. David studies the question “Why do people find probability unintuitive and difficult?”; he suggests that the answer in “because probability in unintuitive and difficult”. He gave a startling example: in a phone survey, if you ask which is the greater risk between 1/100, 1/10 and 1/1000, about 25-30% of people give the wrong answer. Media reports often use such notation, with the numerator fixed to 1. The message would be better understood if probabilities were given with a fixed denominator (1/100, 10/100, 0.1/100). David had many other suggestions on explaining risk and uncertainty to non-statisticians, all very useful.