Reproducing the kidney cancer example from BDA


This is an attempt at reproducing the analysis of Section 2.7 of Bayesian Data Analysis, 3rd edition (Gelman et al.), on kidney cancer rates in the USA in the 1980s. I have done my best to clean the data from the original. Andrew wrote a blog post to “disillusion [us] about the reproducibility of textbook analysis”, in which he refers to this example. This might then be an attempt at reillusionment…

The cleaner data are on GitHub, as is the RMarkDown of this analysis.

d = read.csv("KidneyCancerClean.csv", skip=4)

In the data, the columns dc and dc.2 correspond (I think) to the death counts due to kidney cancer in each county of the USA, respectively in 1908-84 and 1985-89. The columns pop and pop.2 are some measure of the population in the counties. It is not clear to me what the other columns represent.

Simple model

Let n_j be the population on county j, and K_j the number of kidney cancer deaths in that county between 1980 and 1989. A simple model is K_j\sim Poisson(\theta_j n_j) where \theta_j is the unknown parameter of interest, representing the incidence of kidney cancer in that county. The maximum likelihood estimator is \hat\theta_j=\frac{K_j}{n_j}.

d$dct = d$dc + d$dc.2
d$popm = (d$pop + d$pop.2) / 2
d$thetahat = d$dct / d$popm

In particular, the original question is to understand these two maps, which show the counties in the first and last decile for kidney cancer deaths.

q = quantile(d$thetahat, c(.1, .9))
d$cancerlow = d$thetahat <= q[1] d$cancerhigh = d$thetahat >= q[2]
plot_usmap("counties", data=d, values="cancerhigh") +
  scale_fill_discrete(h.start = 200, 
                      name = "Large rate of kidney cancer deaths") 

plot of chunk unnamed-chunk-4

plot_usmap("counties", data=d, values="cancerlow") +
  scale_fill_discrete(h.start = 200, 
                      name = "Low rate of kidney cancer deaths") 

plot of chunk unnamed-chunk-4

These maps are suprising, because the counties with the highest kidney cancer death rate, and those with the lowest, are somewhat similar: mostly counties in the middle of the map.

(Also, note that the data for Alaska are missing. You can hide Alaska on the maps by adding the parameter include = statepop$full[-2] to calls to plot_usmap.)

The reason for this pattern (as explained in BDA3) is that these are counties with a low population. Indeed, a typical value for \hat\theta_j is around 0.0001. Take a county with a population of 1000. It is likely to have no kidney cancer deaths, giving \hat\theta_j=0 and putting it in the first decile. But if it happens to have a single death, the estimated rate jumps to \hat\theta_j=0.001 (10 times the average rate), putting it in the last decile.

This is hinted at in this histogram of the (\theta_j):

ggplot(data=d, aes(d$thetahat)) + 
  geom_histogram(bins=30, fill="lightblue") + 
  labs(x="Estimated kidney cancer death rate (maximum likelihood)", 
       y="Number of counties") +
  xlim(c(-1e-5, 5e-4))

plot of chunk unnamed-chunk-5

Bayesian approach

If you have ever followed a Bayesian modelling course, you are probably screaming that this calls for a hierarchical model. I agree (and I’m pretty sure the authors of BDA do as well), but here is a more basic Bayesian approach. Take a common \Gamma(\alpha, \beta) distribution for all the (\theta_j); I’ll go for \alpha=15 and \beta = 200\ 000, which is slightly vaguer than the prior used in BDA. Obviously, you should try various values of the prior parameters to check their influence.

The prior is conjugate, so the posterior is \theta_j|K_j \sim \Gamma(\alpha + K_j, \beta + n_j). For small counties, the posterior will be extremely close to the prior; for larger counties, the likelihood will take over.

It is usually a shame to use only point estimates, but here it will be sufficient: let us compute the posterior mean of \theta_j. Because the prior has a strong impact on counties with low population, the histogram looks very different:

alpha = 15
beta = 2e5
d$thetabayes = (alpha + d$dct) / (beta + d$pop)
ggplot(data=d, aes(d$thetabayes)) + 
  geom_histogram(bins=30, fill="lightblue") + 
  labs(x="Estimated kidney cancer death rate (posterior mean)", 
       y="Number of counties") +
  xlim(c(-1e-5, 5e-4))

plot of chunk unnamed-chunk-6

And the maps of counties in the first and last decile are now much easier to distinguish; for instance, Florida and New England are heavily represented in the last decile. The counties represented here are mostly populated counties: these are counties for which we have reason to believe that they are on the lower or higher end for kidney cancer death rates.

qb = quantile(d$thetabayes, c(.1, .9))
d$bayeslow = d$thetabayes <= qb[1] d$bayeshigh = d$thetabayes >= qb[2]
plot_usmap("counties", data=d, values="bayeslow") +
    h.start = 200, 
    name = "Low kidney cancer death rate (Bayesian inference)")  

plot of chunk unnamed-chunk-7

plot_usmap("counties", data=d, values="bayeshigh") +
    h.start = 200, 
    name = "High kidney cancer death rate (Bayesian inference)")  

plot of chunk unnamed-chunk-7

An important caveat: I am not an expert on cancer rates (and I expect some of the vocabulary I used is ill-chosen), nor do I claim that the data here are correct (from what I understand, many adjustments need to be made, but they are not detailed in BDA, which explains why the maps are slightly different). I am merely posting this as a reproducible example where the naïve frequentist and Bayesian estimators differ appreciably, because they handle sample size in different ways. I have found this example to be useful in introductory Bayesian courses, as the difference is easy to grasp for students who are new to Bayesian inference.

Black on black tooltips in Firefox with Kubuntu


I use Firefox on Kubuntu, and for a long time I had an issue with the tooltips: the characters were printed in black on a black background (a slightly different shade of black, but still very difficult to read).

I used to have a solution with Stylish, but it broke in Firefox 57 (Firefox Quantum). Here is a solution which works now, for anyone else with the same issue.

  • Navigate to ~/.mozilla/firefox/
  • Find your Firefox profile: a folder with a name like 1rsnaite.default
  • Navigate to ~/.mozilla/firefox/1rsnaite.default/chrome/ or whatnot (you might need to create the chrome/ folder)
  • Using your favourite text editor, open the file ~/.mozilla/firefox/1rsnaite.default/chrome/userChrome.css (creating it if necessary)
  • In this file, put the following code:
  • /* AGENT_SHEET */
    @namespace xul url(;
     color:#FFFFFF !important;
  • Save and restart Firefox.
  • If you have several profiles, repeat for the other profiles.

I am not an expert at these things; if this does not work for you, I won’t be able to help you any better than Google.

I used the following sites to find this solution:

Post-doctoral position in Paris: Statistical modelling for Historical Linguistics


A postdoc position is open, to come work with me and several Linguists at École Normale Supérieure, on questions related to Statistical modelling for the history of human languages and for monkey communication systems.

See the detailed announcement.

Deadline for application is 23 August.

Lecturer position in Statistics at Dauphine


An associate professor (“Maître de conférences”) position in Applied or Computational Statistics is expected to be open at Université Paris-Dauphine. The recruitment process will mostly take place during the spring, for an appointment date of 1 September 2017.

However, candidates must first go through the national “qualification”. This process should not be problematic, but is held much earlier in the year: you need to sign up by 25 October (next week!), then send some documents by December. Unfortunately, the committee cannot consider applications from candidates who do not hold the “qualification”.

If you need help with the process, feel free to contact me.

David Cox is the inaugural recipient of the International Prize in Statistics


David Cox was announced today as the inaugural recipient of the International Prize in Statistics.

My first foray into Statistics was an analysis of Cox models I did for my undegraduate thesis at ENS in 2005. I had no idea back then that David Cox was still alive and active; in my mind, he was a historical figure, on par with other great mathematicians who gave their names to objects of study — Euler, Galois, Lebesgue…

When I arrived at Oxford a few months later, I was amazed to meet him, and to see that he was still very active, both as a researcher and as the organizer of events for doctoral students.

David Cox is the perfect choice as the first person to receive this prize. I hope that the inauguration of this prize will help show the public that Statistics require complex and innovative methods, that have been tackled by some exceptional minds, and should not be seen as a “sub-science” compared to other more “noble” sciences.

MCMSki 4


I am attending the MCMSki 4 conference for the next 3 days; I guess I’ll see many of you there!

I am organizing a session on Wednesday morning on Advances in Monte Carlo motivated by applications; I’m looking forward to hearing the talks of Alexis Muir-Watt, Simon Barthelmé, Lawrence Murray and Rémi Bardenet during that session, as well as the rest of the very strong programme.

I’ll also be part of the jury for the best poster prize; there are many promising abstracts.

Corcoran medal

17/10/2013 weeks ago, I had the great honour of receiving the 2012 Corcoran memorial medal and prize for my doctoral dissertation. It is awarded by Oxford’s Department of Statistics in memory of Stephen Corcoran, a student who died in 1996 before having time to complete his DPhil. Being a Statistics prize, there is smoothing in the award process: it is awarded every two years, to a DPhil which was completed in the last four years (i.e. between October 2008 and October 2012 in my case). The ceremony was part of the Department’s 25th anniversary celebrations.

Nils Lid Hjort gave a lecture on his “confidence distributions”, a way to represent uncertainty in the non-Bayesian framework. Although he gave examples where his representation seems to work best, I wondered how this could extend to cases where the parameter is not unidimensional.

Chris Yau received the 2010 Corcoran prize and gave a short talk on applications of HMMs togenetic data; he was unlucky to have his 15-minute talk interrupted by a fire alarm (but that allowed me to wonder at how calmly efficient the British are at evacuating in such situations). Luckily, my own talk suffered no such interruption.

Peter Donnelly demonstrated once again his amazing lecturing skills, with a highly informative talk on statistical inference of the history of the UK using genetic data.

All in all, a very enjoyable afternoon, which was followed by a lovely dinner at Somerville College, with several speeches on the past, present and future of Statistics at Oxford.

Thanks again to the Corcoran committe, especially Steffen Lauritzen, for selecting me as the prize winner!

Expect some blog posts in French


This is just a warning that from now on, a small proportion of my blog posts will be in French. I’ll use French for posts which I think will appear primarily to French speakers: either posts for students of courses that I give in French, or posts on the French higher educational system which would be of little interest to people outside of France. I guess this is similar to what Arthur Charpentier does.

In particular, I’ll keep posting in English for anything related to my research topics.

PhD acknowledgement


Acknowledgement from an anonymous doctoral dissertation in the University Microforms International database:

If I had a dime for every time my wife threatened to divorce me during the past three years, I would be wealthy and not have to take a postdoctoral position which will only make me a little less poor and will keep me away from home and in the lab even more than graduate school and all because my committee read this manuscript and said that the only alternative to signing the approval to this dissertation was to give me a job mowing the grass on campus but the Physical Plant would not hire me on account of they said I was over-educated and needed to improve my dexterity skills like picking my nose while driving a tractor-mower over poor defenseless squirrels that were eating the nuts they stole from the medical students’ lunches on Tuesday afternoon following the Biochemistry quiz which they all did not pass and blamed on me because they said a tutor was supposed to come with a 30-day money-back guarantee and I am supposed to thank someone for all this?!!

(From a UMI press release, quoted in The Whole Library Handbook 2, 1995)

Source: Futility Closet via Arthur Charpentier.

Unicode in LaTeX


The way I type LaTeX has changed significantly in the past couple of months. I now type most of my math formulae in unicode, which makes the source code much more readable.

A few months ago, I might have written

$\lambda/\mu=\kappa/\nu \Rightarrow \exists \Theta,\forall i, \sum_{j\in\mathbb{N}} E[D_{i,j}]=\Theta$

to display

\lambda/\mu=\kappa/\nu \Rightarrow \exists \Theta,\forall i, \sum_{j\in\mathbb{N}} E[D_{i,j}]=\Theta.

Now, to type the same equation, my LaTeX source code looks like this:

$λ/μ=κ/ν ⇒ ∃Θ,∀i, ∑_{j∈ℕ} E[D_{i,j}]=Θ$

which produces exactly the same output. The source code is much easier to read; it is also slightly easier to type. Here is how the magic works:

  • In the preamble, add
  • A number of special characters (including all Greek letters) were already easily available to me because I use a bépo keyboard (if you are a French speaker, you should try it out); otherwise, all characters are available using any keyboard to users of a Unix-like OS thanks to this great .XCompose file. For example, to get ℕ, use the keys Compose+|+N (pretty intuitive, and faster than typing \mathbb{N}). To get ∃, use Compose+E+E; to get ∈, use Compose+i+n, and so on.
  • There are two issues with this solution: first, the unicode symbol α maps to \textalpha instead of \alpha; second, the blackboard letters map to \mathbbm instead of \mathbb. This can lead to errors, but I wrote this file which solves the issue by including in the preamble:

This is useful for LaTeX, but also for all other places where you might want to type math: thanks to this .XCompose file, typing math in a blog post, tweet or e-mail becomes easy (for example, this is the last blog post where I will use WordPress’ $latex syntax). And if there ever is a LaTeX formula that you cannot access from your keyboard, you can use a website such as which converts LaTeX source code to unicode.

I originally heard about this on Christopher Olah‘s blog.