Reproducing the kidney cancer example from BDA

This is an attempt at reproducing the analysis of Section 2.7 of Bayesian Data Analysis, 3rd edition (Gelman et al.), on kidney cancer rates in the USA in the 1980s. I have done my best to clean the data from the original. Andrew wrote a blog post to “disillusion [us] about the reproducibility of textbook analysis”, in which he refers to this example. This might then be an attempt at reillusionment…

The cleaner data are on GitHub, as is the RMarkDown of this analysis.

d = read.csv("KidneyCancerClean.csv", skip=4)

In the data, the columns dc and dc.2 correspond (I think) to the death counts due to kidney cancer in each county of the USA, respectively in 1908-84 and 1985-89. The columns pop and pop.2 are some measure of the population in the counties. It is not clear to me what the other columns represent.

Simple model

Let n_j be the population on county j, and K_j the number of kidney cancer deaths in that county between 1980 and 1989. A simple model is K_j\sim Poisson(\theta_j n_j) where \theta_j is the unknown parameter of interest, representing the incidence of kidney cancer in that county. The maximum likelihood estimator is \hat\theta_j=\frac{K_j}{n_j}.

d$dct = d$dc + d$dc.2
d$popm = (d$pop + d$pop.2) / 2
d$thetahat = d$dct / d$popm

In particular, the original question is to understand these two maps, which show the counties in the first and last decile for kidney cancer deaths.

q = quantile(d$thetahat, c(.1, .9))
d$cancerlow = d$thetahat <= q[1] d$cancerhigh = d$thetahat >= q[2]
plot_usmap("counties", data=d, values="cancerhigh") +
  scale_fill_discrete(h.start = 200, 
                      name = "Large rate of kidney cancer deaths") 

plot of chunk unnamed-chunk-4

plot_usmap("counties", data=d, values="cancerlow") +
  scale_fill_discrete(h.start = 200, 
                      name = "Low rate of kidney cancer deaths") 

plot of chunk unnamed-chunk-4

These maps are suprising, because the counties with the highest kidney cancer death rate, and those with the lowest, are somewhat similar: mostly counties in the middle of the map.

(Also, note that the data for Alaska are missing. You can hide Alaska on the maps by adding the parameter include = statepop$full[-2] to calls to plot_usmap.)

The reason for this pattern (as explained in BDA3) is that these are counties with a low population. Indeed, a typical value for \hat\theta_j is around 0.0001. Take a county with a population of 1000. It is likely to have no kidney cancer deaths, giving \hat\theta_j=0 and putting it in the first decile. But if it happens to have a single death, the estimated rate jumps to \hat\theta_j=0.001 (10 times the average rate), putting it in the last decile.

This is hinted at in this histogram of the (\theta_j):

ggplot(data=d, aes(d$thetahat)) + 
  geom_histogram(bins=30, fill="lightblue") + 
  labs(x="Estimated kidney cancer death rate (maximum likelihood)", 
       y="Number of counties") +
  xlim(c(-1e-5, 5e-4))

plot of chunk unnamed-chunk-5

Bayesian approach

If you have ever followed a Bayesian modelling course, you are probably screaming that this calls for a hierarchical model. I agree (and I’m pretty sure the authors of BDA do as well), but here is a more basic Bayesian approach. Take a common \Gamma(\alpha, \beta) distribution for all the (\theta_j); I’ll go for \alpha=15 and \beta = 200\ 000, which is slightly vaguer than the prior used in BDA. Obviously, you should try various values of the prior parameters to check their influence.

The prior is conjugate, so the posterior is \theta_j|K_j \sim \Gamma(\alpha + K_j, \beta + n_j). For small counties, the posterior will be extremely close to the prior; for larger counties, the likelihood will take over.

It is usually a shame to use only point estimates, but here it will be sufficient: let us compute the posterior mean of \theta_j. Because the prior has a strong impact on counties with low population, the histogram looks very different:

alpha = 15
beta = 2e5
d$thetabayes = (alpha + d$dct) / (beta + d$pop)
ggplot(data=d, aes(d$thetabayes)) + 
  geom_histogram(bins=30, fill="lightblue") + 
  labs(x="Estimated kidney cancer death rate (posterior mean)", 
       y="Number of counties") +
  xlim(c(-1e-5, 5e-4))

plot of chunk unnamed-chunk-6

And the maps of counties in the first and last decile are now much easier to distinguish; for instance, Florida and New England are heavily represented in the last decile. The counties represented here are mostly populated counties: these are counties for which we have reason to believe that they are on the lower or higher end for kidney cancer death rates.

qb = quantile(d$thetabayes, c(.1, .9))
d$bayeslow = d$thetabayes <= qb[1] d$bayeshigh = d$thetabayes >= qb[2]
plot_usmap("counties", data=d, values="bayeslow") +
    h.start = 200, 
    name = "Low kidney cancer death rate (Bayesian inference)")  

plot of chunk unnamed-chunk-7

plot_usmap("counties", data=d, values="bayeshigh") +
    h.start = 200, 
    name = "High kidney cancer death rate (Bayesian inference)")  

plot of chunk unnamed-chunk-7

An important caveat: I am not an expert on cancer rates (and I expect some of the vocabulary I used is ill-chosen), nor do I claim that the data here are correct (from what I understand, many adjustments need to be made, but they are not detailed in BDA, which explains why the maps are slightly different). I am merely posting this as a reproducible example where the naïve frequentist and Bayesian estimators differ appreciably, because they handle sample size in different ways. I have found this example to be useful in introductory Bayesian courses, as the difference is easy to grasp for students who are new to Bayesian inference.

Late homework

My Bayesian Case Studies students had to hand in their final report last week; as I expected, some handed it in after the deadline.

I know several colleagues whose official position is that if the report is handed in late, the student fails the course. This is fine if the report is very late, but it seems a bit drastic for a small delay (e.g. a few minutes!). The problem of course is to define very and small in the previous sentence. And even if you decide that the course is failed for students who are more than (say) 24 hours late, it seems a bit unfair to have such a big difference between being 23 hours 59 minutes late and 24 hours 1 minute…

This time, I am using a continuous decreasing function as penalty: I told the students that

“Your report should be sent by e-mail by 2 April 2013 at noon. In case of a late submission, your grade will be multiplied by exp(-0.001 ⋅ t³ /86400³ ) where t is the number of seconds between the deadline and your submission.”

This basically means that they can be 2 days late for free. If a student has a legitimate reason for being late, they will usually manage to fit in that window, so they do not need to explain their lateness. Being 4 days late knocks off 6% of the grade; being a week late knocks off 30%. A student more than 8.8 days late loses 50% of their grade, meaning the course is necessarily failed, even if the report is perfect.

In this case, the last student to hand in his work was about 3.5 days late, meaning the mark will be multiplied by 0.953.

I quite like this system, although I think it was a bit too generous with this exact implementation. I cannot claim to have invented it; I read about it somewhere else, with a different penalty function, but cannot find where.

Edit: Thanks to Mahendra for the original post: here, on the Messy Matters blog.

Bayesian model choice for the Poisson model

Following Arthur Charpentier‘s example, I am going to try to post occasionally on material covered during my courses, in the hope that it might be useful to my students, but also to others.

In the second practical of the Bayesian Case Studies course, we looked at Bayesian model choice and basic Monte Carlo methods, looking at data about the number of oral questions asked by French deputies (Members of Parliament). We wanted to know whether male and female deputies behave differently in this matter. If we model the number of questions Y_i asked by a deputy by a Poisson random variable and denote Z_i their gender (1 for men, 2 for women), we thus need to choose between the following two models. (See the course website for the question sheet and other material; the slides are below.)

Model 1: Y_i\sim Poisson(\lambda); \quad \lambda \sim \Gamma(a,b)

Model 2: Y_i|Z_i=k\sim Poisson(\lambda_k); \quad \lambda_1\sim \Gamma(a,b); \quad \lambda_2\sim \Gamma(a,b)

First download the data and set some variables:

 > deputes=read.csv2('')
 > attach(deputes)#Q1
 > summary(questions_orales)
 > hist(questions_orales,breaks=(0:6-.5))
 > (n=length(questions_orales))
 > (nh=sum(sexe=="H"))
 > (nf=sum(sexe=="F"))
 > (qtot=sum(questions_orales))
 > (qh=sum(questions_orales[sexe=="H"]))
 > (qf=sum(questions_orales[sexe=="F"]))

Since the Gamma prior is a conjugate prior for the Poisson model, we get a Gamma distribution as our posterior. Using a \Gamma(2,2) as prior, here is a comparison of the prior and the posteriors for the two models (in the bottom plot, \lambda_1 is in black and \lambda_2 in red.


Two observations here: First, the posterior is much more peaked than the prior, which is expected since the number of data points is large. Second, in model 2, it seems that there may be a difference between men and women, with women asking slightly less questions in Parliament.To verify this, we compute a Bayes factor to compare the two models. In this case, the Bayes factor is available in closed form, but an initial computation does not work because n is too large:

> BF_analytical=function(a,b,y1,n1,y2,n2){
+   m1=b^a/gamma(a)*gamma(a+y1+y2)/(b+n1+n2)^(a+y1+y2)
+   m2=b^(2*a)/gamma(a)^2*gamma(a+y1)/(b+n1)^(a+y1)*gamma(a+y2)/(b+n2)^(a+y2)
+   return(m2/m1)
+ }
> BF_analytical(2,2,qh,nh,qf,nf)
[1] NaN
Warning messages:
1: In BF_analytical(2, 2, qh, nh, qf, nf) :
  value out of range in 'gammafn'
2: In BF_analytical(2, 2, qh, nh, qf, nf) :
  value out of range in 'gammafn'

However, we are able to compute the Bayes factor analytically by using the log-scale instead.

> BF_analytical2=function(a,b,y1,n1,y2,n2){
+   m1=a*log(b)-lgamma(a)+lgamma(a+y1+y2)-(a+y1+y2)*log(b+n1+n2)
+   m2=2*a*log(b)-2*lgamma(a)+lgamma(a+y1)-(a+y1)*log(b+n1)+lgamma(a+y2)-(a+y2)*log(b+n2)
+   return(exp(m2-m1))
+ }
> BF_analytical2(2,2,qh,nh,qf,nf)
[1] 0.1875569
> log10(BF_analytical2(2,2,qh,nh,qf,nf))
[1] -0.726867

Using Jeffrey’s scale of evidence, this corresponds to “substantial” evidence in favour of model 1 (i.e. no difference between men and women).

Even if we had not been able to calculate the Bayes factor exactly, we could have approximated it using Monte Carlo. We first create two functions to calculate the likelihood under model 1 and model 2, and start with vanilla Monte Carlo.

> lkd.model1=function(y,n,lambda){  return(exp(-n*lambda+y*log(lambda))) }
> lkd.model2=function(y1,n1,y2,n2,lambda1,lambda2){  return(exp(-n1*lambda1+y1*log(lambda1)-n2*lambda2+y2*log(lambda2))) }

> BF_MC=function(a,b,y1,n1,y2,n2,M){
 +  lambda1=rgamma(M,a,b)
 +  m1=cumsum(lkd.model1(y1+y2,n1+n2,lambda1))/(1:M)
 + lambda2.1=rgamma(M,a,b)
 + lambda2.2=rgamma(M,a,b)
 +  m2=cumsum(lkd.model2(y1,n1,y2,n2,lambda2.1,lambda2.2))/(1:M)
 +  return(m2/m1)
 + }

> M=100000
> BF_MC(2,2,qh,nh,qf,nf,M)[M]
[1] 0.1776931
> plot(100:M,BF_MC(2,2,qh,nh,qf,nf,M)[100:M], type="l")

The last line creates a plot of how our Monte Carlo estimate of the Bayes factor changes as the number of iterations M increases. It is useful to check convergence.vanilla

Here, it is clear that the method converges after a few tens of thousands iterations. Our estimate for M=10^5 is close to the true value (two decimal places precision). Next, we try the harmonic mean method:

> BF_HM=function(a,b,y1,n1,y2,n2,M){
+   lambda1=rgamma(M,a+y1+y2,b+n1+n2)
+   m1=(1:M)/(cumsum(1/lkd.model1(y1+y2,n1+n2,lambda1)))
 +  lambda2.1=rgamma(M,a+y1,b+n1)
 +  lambda2.2=rgamma(M,a+y2,b+n2)
 +  m2=(1:M)/cumsum(1/lkd.model2(y1,n1,y2,n2,lambda2.1,lambda2.2))
+   return(m2/m1)  
+ }

> BF_HM(2,2,qh,nh,qf,nf,M)[M]
[1] 0.2451013
> plot(100:M,BF_HM(2,2,qh,nh,qf,nf,M)[100:M], type="l")

We observe that this does not appear to converge. It is known that in general, the harmonic mean method is unreliable.

We finally move on to importance sampling. In importance sampling, we want to estimate E[h(X)] where X\sim f by using simulations from an alternate density g. This will be most efficient when g is close to $h\times f$. In this case, we therefore need g to be similar to the posterior distribution, so we take a normal density with mean and variance the same as the posterior distribution:

> BF_IS=function(a,b,y1,n1,y2,n2,M){
+  mean1=(a+y1+y2)/(b+n1+n2)
+  mean2.1=(a+y1)/(b+n1)
+  mean2.2=(a+y2)/(b+n2)

+  sigma1=sqrt((a+y1+y2)/(b+n1+n2)^2)
+  sigma2.1=sqrt((a+y1)/(b+n1)^2)
+  sigma2.2=sqrt((a+y2)/(b+n2)^2)

+  lambda1=rnorm(M,mean1,sigma1)
+  m1=cumsum( lkd.model1(y1+y2,n1+n2,lambda1) * dgamma(lambda1,a, b) / dnorm(lambda1,mean1,sigma1))/(1:M)

+  lambda2.1=rnorm(M,mean2.1, sigma2.1)
+  lambda2.2=rnorm(M,mean2.2, sigma2.2)
+  m2=cumsum( lkd.model2(y1,n1,y2,n2,lambda2.1,lambda2.2) * dgamma(lambda2.1,a, b) * dgamma(lambda2.2,a,b) / 
+    (dnorm(lambda2.1,mean2.1,sigma2.1)*dnorm(lambda2.2,mean2.2,sigma2.2))) / (1:M)

+  return(m2/m1)
+ }

> BF_IS(2,2,qh,nh,qf,nf,M)[M]
[1] 0.1875875
> plot(100:M,BF_IS(2,2,qh,nh,qf,nf,M)[100:M], type="l")

ISThis method converges. Note that the y-scale is much tighter than for vanilla Monte Carlo: convergence is in fact much faster, and we get a more precise estimate than by using vanilla Monte Carlo. This is more obvious if we look at both methods on the same graph.

> plot(100:M,BF_MC(2,2,qh,nh,qf,nf,M)[100:M], type="l") 
> lines(100:M,BF_IS(2,2,qh,nh,qf,nf,M)[100:M], col="red")

It is much more visible now that importance sampling is a clear winner.

With this computation of the Bayes factor, we choose model 1: a single parameter can be used to model the behaviour of men and women.