You need and to be negatively correlated for this. I wrote the problem in the lab coffee room, leading to nice discussions (see also Xian’s blog post). Here are two solutions to the problem:
1. Let and . Then:
2. A second solution, found by my colleague Amic Frouvelle, is to sample uniformly from the black area:
I quite like that the first solution is 1d but the second is 2d.
]]>
We congratulate the authors for this excellent paper.
In “traditional” MCMC, it is standard to check that stationarity has been attained by running a small number of parallel chains, initiated at different starting points, to verify that the final distribution is independent of the initialization — even though the single versus multiple chain(s) debate errupted from the start with Gelman and Rubin (1992) versus Geyer (1992).
As noted by the authors, a bad choice of the initial distribution can lead to poor properties. In essence, this occurs and remains undetected for the current proposal because the coupling of the chains occurs long before the chain reaches stationarity. We would like to make two suggestions to alleviate this issue, and hence add a stationarity check as a byproduct of the run.
The ideal algorithm is one which gives a correct answer when it has converged, and a warning or error when it hasn’t. MCMC chains which have not yet reached stationarity (for example because they have not found all modes of a multimodal distribution) can be hard to detect. Here, this issue is more likely to be detected since it would lead to the coupling not occuring: is large, and this is a feature, since it warns the practitioner that their kernel is ill-fitted to the target density.
]]>The cleaner data are on GitHub, as is the RMarkDown of this analysis.
library(usmap)
library(ggplot2)
d = read.csv("KidneyCancerClean.csv", skip=4)
In the data, the columns dc and dc.2 correspond (I think) to the death counts due to kidney cancer in each county of the USA, respectively in 1908-84 and 1985-89. The columns pop and pop.2 are some measure of the population in the counties. It is not clear to me what the other columns represent.
Let be the population on county , and the number of kidney cancer deaths in that county between 1980 and 1989. A simple model is where is the unknown parameter of interest, representing the incidence of kidney cancer in that county. The maximum likelihood estimator is .
d$dct = d$dc + d$dc.2
d$popm = (d$pop + d$pop.2) / 2
d$thetahat = d$dct / d$popm
In particular, the original question is to understand these two maps, which show the counties in the first and last decile for kidney cancer deaths.
q = quantile(d$thetahat, c(.1, .9))
d$cancerlow = d$thetahat <= q[1] d$cancerhigh = d$thetahat >= q[2]
plot_usmap("counties", data=d, values="cancerhigh") +
scale_fill_discrete(h.start = 200,
name = "Large rate of kidney cancer deaths")
plot_usmap("counties", data=d, values="cancerlow") +
scale_fill_discrete(h.start = 200,
name = "Low rate of kidney cancer deaths")
These maps are suprising, because the counties with the highest kidney cancer death rate, and those with the lowest, are somewhat similar: mostly counties in the middle of the map.
(Also, note that the data for Alaska are missing. You can hide Alaska on the maps by adding the parameter include = statepop$full[-2]
to calls to plot_usmap
.)
The reason for this pattern (as explained in BDA3) is that these are counties with a low population. Indeed, a typical value for is around . Take a county with a population of 1000. It is likely to have no kidney cancer deaths, giving and putting it in the first decile. But if it happens to have a single death, the estimated rate jumps to (10 times the average rate), putting it in the last decile.
This is hinted at in this histogram of the :
ggplot(data=d, aes(d$thetahat)) +
geom_histogram(bins=30, fill="lightblue") +
labs(x="Estimated kidney cancer death rate (maximum likelihood)",
y="Number of counties") +
xlim(c(-1e-5, 5e-4))
If you have ever followed a Bayesian modelling course, you are probably screaming that this calls for a hierarchical model. I agree (and I’m pretty sure the authors of BDA do as well), but here is a more basic Bayesian approach. Take a common distribution for all the ; I’ll go for and , which is slightly vaguer than the prior used in BDA. Obviously, you should try various values of the prior parameters to check their influence.
The prior is conjugate, so the posterior is . For small counties, the posterior will be extremely close to the prior; for larger counties, the likelihood will take over.
It is usually a shame to use only point estimates, but here it will be sufficient: let us compute the posterior mean of . Because the prior has a strong impact on counties with low population, the histogram looks very different:
alpha = 15
beta = 2e5
d$thetabayes = (alpha + d$dct) / (beta + d$pop)
ggplot(data=d, aes(d$thetabayes)) +
geom_histogram(bins=30, fill="lightblue") +
labs(x="Estimated kidney cancer death rate (posterior mean)",
y="Number of counties") +
xlim(c(-1e-5, 5e-4))
And the maps of counties in the first and last decile are now much easier to distinguish; for instance, Florida and New England are heavily represented in the last decile. The counties represented here are mostly populated counties: these are counties for which we have reason to believe that they are on the lower or higher end for kidney cancer death rates.
qb = quantile(d$thetabayes, c(.1, .9))
d$bayeslow = d$thetabayes <= qb[1] d$bayeshigh = d$thetabayes >= qb[2]
plot_usmap("counties", data=d, values="bayeslow") +
scale_fill_discrete(
h.start = 200,
name = "Low kidney cancer death rate (Bayesian inference)")
plot_usmap("counties", data=d, values="bayeshigh") +
scale_fill_discrete(
h.start = 200,
name = "High kidney cancer death rate (Bayesian inference)")
An important caveat: I am not an expert on cancer rates (and I expect some of the vocabulary I used is ill-chosen), nor do I claim that the data here are correct (from what I understand, many adjustments need to be made, but they are not detailed in BDA, which explains why the maps are slightly different). I am merely posting this as a reproducible example where the naïve frequentist and Bayesian estimators differ appreciably, because they handle sample size in different ways. I have found this example to be useful in introductory Bayesian courses, as the difference is easy to grasp for students who are new to Bayesian inference.
]]>I used to have a solution with Stylish, but it broke in Firefox 57 (Firefox Quantum). Here is a solution which works now, for anyone else with the same issue.
/* AGENT_SHEET */ @namespace xul url(http://www.mozilla.org/keymaster/gatekeeper/there.is.only.xul); #btTooltip, #un-toolbar-tooltip, #tooltip, .tooltip, #aHTMLTooltip, #urlTooltip, tooltip, #aHTMLTooltip, #urlTooltip, #brief-tooltip, #btTooltipTextBox, #un-toolbar-tooltip { color:#FFFFFF !important; }
I am not an expert at these things; if this does not work for you, I won’t be able to help you any better than Google.
I used the following sites to find this solution:
]]>See the detailed announcement.
Deadline for application is 23 August.
]]>However, candidates must first go through the national “qualification”. This process should not be problematic, but is held much earlier in the year: you need to sign up by 25 October (next week!), then send some documents by December. Unfortunately, the committee cannot consider applications from candidates who do not hold the “qualification”.
If you need help with the process, feel free to contact me.
]]>My first foray into Statistics was an analysis of Cox models I did for my undegraduate thesis at ENS in 2005. I had no idea back then that David Cox was still alive and active; in my mind, he was a historical figure, on par with other great mathematicians who gave their names to objects of study — Euler, Galois, Lebesgue…
When I arrived at Oxford a few months later, I was amazed to meet him, and to see that he was still very active, both as a researcher and as the organizer of events for doctoral students.
David Cox is the perfect choice as the first person to receive this prize. I hope that the inauguration of this prize will help show the public that Statistics require complex and innovative methods, that have been tackled by some exceptional minds, and should not be seen as a “sub-science” compared to other more “noble” sciences.
]]>I am organizing a session on Wednesday morning on Advances in Monte Carlo motivated by applications; I’m looking forward to hearing the talks of Alexis Muir-Watt, Simon Barthelmé, Lawrence Murray and Rémi Bardenet during that session, as well as the rest of the very strong programme.
I’ll also be part of the jury for the best poster prize; there are many promising abstracts.
]]>Nils Lid Hjort gave a lecture on his “confidence distributions”, a way to represent uncertainty in the non-Bayesian framework. Although he gave examples where his representation seems to work best, I wondered how this could extend to cases where the parameter is not unidimensional.
Chris Yau received the 2010 Corcoran prize and gave a short talk on applications of HMMs togenetic data; he was unlucky to have his 15-minute talk interrupted by a fire alarm (but that allowed me to wonder at how calmly efficient the British are at evacuating in such situations). Luckily, my own talk suffered no such interruption.
Peter Donnelly demonstrated once again his amazing lecturing skills, with a highly informative talk on statistical inference of the history of the UK using genetic data.
All in all, a very enjoyable afternoon, which was followed by a lovely dinner at Somerville College, with several speeches on the past, present and future of Statistics at Oxford.
Thanks again to the Corcoran committe, especially Steffen Lauritzen, for selecting me as the prize winner!
]]>In particular, I’ll keep posting in English for anything related to my research topics.
]]>