Reproducing the kidney cancer example from BDA

This is an attempt at reproducing the analysis of Section 2.7 of Bayesian Data Analysis, 3rd edition (Gelman et al.), on kidney cancer rates in the USA in the 1980s. I have done my best to clean the data from the original. Andrew wrote a blog post to “disillusion [us] about the reproducibility of textbook analysis”, in which he refers to this example. This might then be an attempt at reillusionment…

The cleaner data are on GitHub, as is the RMarkDown of this analysis.

d = read.csv("KidneyCancerClean.csv", skip=4)

In the data, the columns dc and dc.2 correspond (I think) to the death counts due to kidney cancer in each county of the USA, respectively in 1908-84 and 1985-89. The columns pop and pop.2 are some measure of the population in the counties. It is not clear to me what the other columns represent.

Simple model

Let n_j be the population on county j, and K_j the number of kidney cancer deaths in that county between 1980 and 1989. A simple model is K_j\sim Poisson(\theta_j n_j) where \theta_j is the unknown parameter of interest, representing the incidence of kidney cancer in that county. The maximum likelihood estimator is \hat\theta_j=\frac{K_j}{n_j}.

d$dct = d$dc + d$dc.2
d$popm = (d$pop + d$pop.2) / 2
d$thetahat = d$dct / d$popm

In particular, the original question is to understand these two maps, which show the counties in the first and last decile for kidney cancer deaths.

q = quantile(d$thetahat, c(.1, .9))
d$cancerlow = d$thetahat <= q[1] d$cancerhigh = d$thetahat >= q[2]
plot_usmap("counties", data=d, values="cancerhigh") +
  scale_fill_discrete(h.start = 200, 
                      name = "Large rate of kidney cancer deaths") 

plot of chunk unnamed-chunk-4

plot_usmap("counties", data=d, values="cancerlow") +
  scale_fill_discrete(h.start = 200, 
                      name = "Low rate of kidney cancer deaths") 

plot of chunk unnamed-chunk-4

These maps are suprising, because the counties with the highest kidney cancer death rate, and those with the lowest, are somewhat similar: mostly counties in the middle of the map.

(Also, note that the data for Alaska are missing. You can hide Alaska on the maps by adding the parameter include = statepop$full[-2] to calls to plot_usmap.)

The reason for this pattern (as explained in BDA3) is that these are counties with a low population. Indeed, a typical value for \hat\theta_j is around 0.0001. Take a county with a population of 1000. It is likely to have no kidney cancer deaths, giving \hat\theta_j=0 and putting it in the first decile. But if it happens to have a single death, the estimated rate jumps to \hat\theta_j=0.001 (10 times the average rate), putting it in the last decile.

This is hinted at in this histogram of the (\theta_j):

ggplot(data=d, aes(d$thetahat)) + 
  geom_histogram(bins=30, fill="lightblue") + 
  labs(x="Estimated kidney cancer death rate (maximum likelihood)", 
       y="Number of counties") +
  xlim(c(-1e-5, 5e-4))

plot of chunk unnamed-chunk-5

Bayesian approach

If you have ever followed a Bayesian modelling course, you are probably screaming that this calls for a hierarchical model. I agree (and I’m pretty sure the authors of BDA do as well), but here is a more basic Bayesian approach. Take a common \Gamma(\alpha, \beta) distribution for all the (\theta_j); I’ll go for \alpha=15 and \beta = 200\ 000, which is slightly vaguer than the prior used in BDA. Obviously, you should try various values of the prior parameters to check their influence.

The prior is conjugate, so the posterior is \theta_j|K_j \sim \Gamma(\alpha + K_j, \beta + n_j). For small counties, the posterior will be extremely close to the prior; for larger counties, the likelihood will take over.

It is usually a shame to use only point estimates, but here it will be sufficient: let us compute the posterior mean of \theta_j. Because the prior has a strong impact on counties with low population, the histogram looks very different:

alpha = 15
beta = 2e5
d$thetabayes = (alpha + d$dct) / (beta + d$pop)
ggplot(data=d, aes(d$thetabayes)) + 
  geom_histogram(bins=30, fill="lightblue") + 
  labs(x="Estimated kidney cancer death rate (posterior mean)", 
       y="Number of counties") +
  xlim(c(-1e-5, 5e-4))

plot of chunk unnamed-chunk-6

And the maps of counties in the first and last decile are now much easier to distinguish; for instance, Florida and New England are heavily represented in the last decile. The counties represented here are mostly populated counties: these are counties for which we have reason to believe that they are on the lower or higher end for kidney cancer death rates.

qb = quantile(d$thetabayes, c(.1, .9))
d$bayeslow = d$thetabayes <= qb[1] d$bayeshigh = d$thetabayes >= qb[2]
plot_usmap("counties", data=d, values="bayeslow") +
    h.start = 200, 
    name = "Low kidney cancer death rate (Bayesian inference)")  

plot of chunk unnamed-chunk-7

plot_usmap("counties", data=d, values="bayeshigh") +
    h.start = 200, 
    name = "High kidney cancer death rate (Bayesian inference)")  

plot of chunk unnamed-chunk-7

An important caveat: I am not an expert on cancer rates (and I expect some of the vocabulary I used is ill-chosen), nor do I claim that the data here are correct (from what I understand, many adjustments need to be made, but they are not detailed in BDA, which explains why the maps are slightly different). I am merely posting this as a reproducible example where the naïve frequentist and Bayesian estimators differ appreciably, because they handle sample size in different ways. I have found this example to be useful in introductory Bayesian courses, as the difference is easy to grasp for students who are new to Bayesian inference.

Visiting UCLA

I shall be spending next week (4th-9th March) at the UCLA Department of Linguistics, where I shall give five hours of talks on Statistics for Historical Linguistics: a two-hour review of the field (Thursday morning), a two-hour presentation of my own DPhil work (pdf) (Friday afternoon), and a one-hour tutorial on TraitLab (Friday afternoon?). Abstracts:

1. Statistical methods in Historical Linguistics (Thursday morning)
Recent advances in our understanding of language change, in statistical methodology, and in computational power, along with an increasing wealth of available data, have allowed significant progress in statistical modelling of language change, and quantitative methods are gaining traction in Historical Linguistics. Models have been developed for the change through time of vocabulary, morpho-syntactic and phonetic traits. I shall present a review of these models (from a statistician’s point of view), starting with Morris Swadesh’s failed attempts at glottochronology, then looking at some models developed in the last decade. In parallel, I shall provide brief insights into statistical tools such as Bayesian statistics and Markov Chain Monte Carlo, in order to show how to use these effectively for linguistic applications.
2. A phylogenetic model of language diversification (Friday afternoon)
Language diversification is a random process similar in many ways to biological evolution. We model the diversification of so-called “core” lexical data by a stochastic process on a phylogenetic tree. We initially focus on the Indo-European language family. The age of the most recent common ancestor of these languages is of particular interest and issues of dating ancient languages have been subject to controversy. We use Markov Chain Monte Carlo to estimate the tree topology, internal node ages and model parameters. Our model includes several aspects specific to language diversification, such as rate heterogeneity and the data registration process, and we show that lexical borrowing does not bias our estimates. We show the robustness of our model through extensive validation and analyse two independent data sets to estimates the age of Proto-Indo-European. We then analyse a data set of Semitic languages, and show an extension of our model to explore whether languages evolve in “punctuational bursts”. Finally, we revisit an analysis of several small data sets by Bergsland & Vogt (1962).
Joint work with Geoff Nicholls

3. Tutorial and practical: TraitLab, a package for phylogenies of linguistic and cultural traits
In this tutorial, I shall present how to use the TraitLab package, which was initially developed specifically for the modelling of core vocabulary change through time, and guide interested attendants through an analysis of a simple data set. TraitLab is a software package for simulating, fitting and analysing tree-like binary data under a stochastic Dollo model of evolution. It handles “catastrophic” rate heterogeneity and missing data. The core of the package is a Markov chain Monte Carlo (MCMC) sampling algorithm that enables the user to sample from the Bayesian joint posterior distributions for tree topologies, clade and root ages, and the trait loss and catastrophe rates for a given data set. Data can be simulated according to the fitted Dollo model or according to a number of generalized models that allow for borrowing (horizontal transfer) of traits, heterogeneity in the trait loss rate and biases in the data collection  process. Both the raw data and the output of MCMC runs can be inspected using a number of useful graphical and analytical tools  provided within the package. TraitLab is freely available and runs within the Matlab computing environment.

Attendants who wish to use TraitLab during the practical should have a computer with Matlab installed.

TraitLab was developed jointly with Geoff Nicholls and David Welch.

Contigency exigency

The SIGecom Exchanges, an engineering and economics journal on e-commerce, run a puzzle in each issue, written by Daniel Reeves (more on Dan’s awesomeness soon). The December 2011 puzzle asked to find a scheme to pay a lawyer when there is uncertainty in the fruits of their labour, while still taking into account the amount of labour (see here for the complete puzzle).

After 6 months without a solution, Dan offered a bounty for this puzzle, where the award would be a realization of a uniform draw between 0 and 500 dollars (realization given by geohashing). After some extra nerdery to choose the winner of the bounty between Arthur Breitman and myself, I ended up winning $65.14, and my solution appears in the latest issue of the SIGecom Exchanges.

This was great fun; other journals should pick up the idea!

A Million Random Digits with 100,000 Normal Deviates

Random digitsI had heard before of the 1955 book A Million Random Digits with 100,000 Normal Deviates (which is exactly what you think it is), but until Greg Ross pointed them out, I had never looked at the comments on Amazon.  They are hilarious. A short selection:

  • “Such a terrific reference work! But with so many terrific random digits, it’s a shame they didn’t sort them, to make it easier to find the one you’re looking for.”
  • “If you like this book, I highly recommend that you read it in the original binary.”
  • “While the printed version is good, I would have expected the publisher to have an audiobook version as well.”
  • “Once you get about halfway in, the rest of the story is pretty predictable.”
  • “What other author could give us such a masterpiece? Infinite monkeys typing on infinite typewriters might be able to produce Shakespeare, but they could never produce something like this!”
  • “My only regret is that there isn’t a sequel, because the author left it at a cliffhanger. ”
  • “It starts with an innocent 10097, rapidly succeded by 32533. The reader has no idea if 10097 will ever appear again, and that’s the thrilling part. I won’t spoil the story, so if you want to know whether 10097 is repeated, buy the book.”

(David Pogue has a list of other Amazon products which attract funny “reviews”.)

A Million Random Digits with 100,000 Normal DeviatesA

Ngrams: when Statistics overtook Mathematics

The Internet is abuzz (1 2 3)with Google’s Ngrams Viewer which allows to track frequency of usage of words and short phrases in “lots of books” from 1500 to 2008, in six languages.

It would be easy to spend hours playing at these data. Here is a quick look at when Statistics (rightfully) overtook Mathematics, in English and in French books:

Statistics overtook Mathematics half-a-century earlier in English than it did in French. I wonder what caused the bump in French around 1960.

Data on shared bicycles in Lyon

Pablo Jensen et al. recently posted on arXiv a short analysis of a fantastic data set: 11 million trips made with the shared bicycles Vélo’v in Lyon. For every trip, the start station, final station, and trip time, duration and distance are available.

Among other things, they show that the average Vélo’v is faster than the average car at peak time, that cyclists are faster on Wednesdays and slower during the week-ends, and that winter speeds are higher (presumably because casual – hence slower – cyclists only cycle during the summer). My guess would be that cyclists who ride their own bikes, rather than use Vélo’v, are even faster, since they are probably more used to cycling and definitely have better and lighter bikes.

Given the start and final point of a trip, they can also calculate the length of the shortest path which obeys all one-way streets, and they show that a whopping 61% of cyclists take a shortcut, which must involve cycling the wrong way or on pavements. (It goes without saying that Parisian cyclists would never do such a thing.)

This is a fantastic data set, and I cannot wait to see more analyses, or some visualizations à la Pedro M. Cruz.