Right after writing our comments on nested sampling and sending it to the author, we received this email from John Skilling. While the overall bullying tone is fairly annoying, but can be ignored, there were a few points that deserved attention. In particular, I am still uncertain about the overall potential of the method, i.e. whether sampling within a more and more limited region of higher and higher likelihood can bring an improvement over regular sampling techniques. (In the paper, there is no sign of a measure of this improvement.) There is also the fact that we won't be granted the luxury of a reply to the reply. X |
Hi John, I know that this praise is late in coming, but your nested sampling algorithm is great!!! I wrote up a Matlab version based on your C code and it works so nicely. I was wondering how this can best be parallelized.... |
Makes me feel really sorry to be neither bright, nor young...! Sounds too much like a commercial, though!!! |
I did try the method and wrote a small R program on a toy example to this effect. As N goes to infinity, it indeed converges to the true value, for the same reason Rieman sums are converging to integrals. So overall I am indeed less confident that it is a valuable method for approximating marginals because it suffers from the same drawbacks as numerical integration, including the curse of dimensionality. |
Overlooking the condescending tone of the sentence, I still do not see how one can do simpler than using a trapezoidal or step-function approximation to an integral as given at the top of page 2 of our discussion. See also Figure 3 in the paper. The nested sampling method sums up as the use of a step-function with points generated from the prior restricted to higher and higher regions of the likelihood. |
Difficult to argue against the fact that using the same notation for two similar but different objects in the same paragraph is not the same thing as using twice the same symbol in a computer program. Even more difficult to see that as trendy... |
Infinite marginals are obviously of no interest from a Bayesian point of view, but for an infinite likelihood with finite marginal, I wonder about the convergence of the algorithm. This is not obvious for the trapezoidal approximation. |
Either one needs to simulate directly from the prior under the likelihood constraint and this implies likelihood-weighting, or one uses a Metropolis step and then it is exactly Metropolis-Hastings. |
The central motivation for the method in the paper is the approximation of the marginal (evidence), which in its turn is mostly useful for model comparison. So it would make sense to compare nested sampling with tools that have been around for quite a while for computing Bayes factors. Overall, I am a bit wary of giving a specific meaning to evidence per se : it is only the ratios of marginals that matter. (The revision of the paper stress this very point in its introduction.) Reversible-jump is designed for that purpose, visiting each model in proportion to its evidence, so I do not understand why this is not relevant... The dismissal of its efficiency when models are not sufficiently similar (meaning?) is rather cavalier, given ten years of evidence to the opposite in the literature. |
I definitely do not wish to fight ownership with Newton and
Leibniz. Simply, using a mix of simulation and of Riemann integration
has been around for quite a while in the literature, as for instance in
Yakowitz et al. (1978) or Philippe (1997). The main interest of these
mixed methods is that the error variance is decreasing faster than with
usual Monte Carlo methods, the smoother the integrand, the faster the
convergence actually. (Hence the reference to our book, Section 4.3.) Another point worth mentioning is that convergence issues are rather speedily evacuated from the paper. Again, when running the above R program, the approximation does converge to the true value at a speed of O(sqrt(N)), as shown in the discussion of the paper by Raftery et al. |
This is still a central issue in simulation for Bayesian statistics: simulating from the prior is not efficient when the likelihood is informative because this induces a waste of simulations in low likelihood regions. And yes, there exist optimality results within the Monte Carlo literature (as for instance the choice of importance functions that minimise the variance or some entropy measure). |
This is a usual confusion on the superiority of proper vague priors over improper priors. Vague priors are no help at all in testing and model choice setups, as shown by the Lindley-Jeffreys paradox. Having a vague idea of the order-of-magnitude is not enough to provide a well-defined Bayes factor, unfortunately. See Berger (2000, vignette in JASA) for more discussion of this issue. |
It is most surprising to advance ignorance as an argument! Slice sampling is related to nested sampling and this paper of Mira, Moeller and Roberts is of course relevant because it exhibits the proper order associated with slice sampling, which is exactly the likelihood ordering. The authors thus build up a very generic slice sampler with monotonicity properties, but simulating from the slices is not always feasible. |
Actually, using a single step of MCMC is just as valid as using T=1000 or T=1,000,000 iterations in this setting. |
It really depends how one defines efficiency: if all of the N points are at rather low likelihood values (which may clearly happen given that one is simulating from the prior distribution), then using these N points does not quickly improve the approximation of the marginal. I agree that it is not clear why simulating N new points would improve the convergence, though, if those are still simulated from the truncated prior distribution. |
I think that our integral representation of Z shows that we understand that the Xi's are attached to Li rather than to a value of theta and that sampling theta's rather than directly the L's is a matter of convenience. I also agree that simulating the Xi's uniformly under the constraint is simular to simulating the theta's from pi under the corresponding constraint. An interesting point: simulating a new Xi conditional on Li being larger than the previous L produces a value with the same distribution as the remaining Xj's. |
This again is a surprising argument given that
algorithms are based on frequentist principles like the Law of
Large Numbers! The name of Rao-Blackwellisation may be poorly chosen
since it relates to a frequentist domination result, but the worth of
algorithms is assessed in terms of speed of convergence and of variance
reduction. Integrating out part of the randomness reduces the variance without damaging the algorithm in any objective respect. In addition, those high Bayesian principles are soon forgotten in the derivation of the uncertainty on log(Z) in Section 2.4, which sounds rather frequentist, to say the least. |
We completely agree on that point that the expectation of Xi is not exp(-i/N). The approximation is still based on an expectation, on formula (18). Note that the expectation of Xi is known, since the ti's are [conditionaly] Beta(N+1,1) distributed, as given in formula (17) of the paper; therefore, the [marginal] expectation of Xi is (N/N+1)i and this could be used as well. Note that formula (26) seems to be wrong since exp(-i/N) is an approximation of Xi, not of wi. |
As mentioned above, Yakowitz et al. (1978) found rates of convergence of order at least O(N) and higher for smoother functions L. |