Chapter 12: Ignorance priors and transformation groups
- p. 375, equation (12.7): Insert a minus sign in front of the integral.
- p. 378, line 4: ``Harr'' should be ``Haar.''
- p. 378, equation (12.19): ``
'' should be
- p. 381, second half of page: ``lies in the equations ,
'' should be ``lies in the equations ,
- p. 382, equation (12.37): The right-hand-side of the equation is wrong; it
should be ``
- p. 384, equation (12.48): the denominator of the left-hand side should
- p. 385, equation (12.51): ``'' should be ``''.
- p. 386, fourth full paragraph: ``Kendell'' should be
- p. 394, third full paragraph, second line: ``James Clark Maxwell''
should be ``James Clerk Maxwell.''
Chapters 11 and 12 I found quite exciting and useful, as construction of
reasonable priors is a subject that seems to get short shrift in most books on
Bayesian methods, and the notion of an objective prior, that encodes exactly
the information one has at hand and nothing more, is quite appealing.
However, I disagree with Jaynes's construction in 12.4.3 of an
ignorance prior for an "unknown probability for success" , which he
concludes should be an improper prior proportional to
over the interval . (This appears to have been first suggested as an
ignorance prior by J. Haldane in 1932.) I will argue that Jaynes's rules
point to the uniform distribution over as the appropriate ignorance
prior. I'll begin by critiquing specific passages in 12.4.3.
- p. 383, second full paragraph:
For example, in a chemical laboratory we
find a jar containing an unknown and unlabeled compound. We are at first
completely ignorant as to whether a small sample of this compound will
dissolve in water or not. But, having observed that one small sample does
dissolve, we infer immediately that all samples of this compound are water
soluble, and although this conclusion does not carry quite the force of
deductive proof, we feel strongly that the inference was justified. Yet the
Bayes-Laplace rule [uniform prior] leads to a negligibly small probability for
this being true, and yields only a probability of that the next sample
tested will dissolve.
Critique: This example is irrelevant for evaluating proposed ignorance
priors over , as this is a situation where we have quite substantial
prior information. We know that the relevant information in determining
whether a sample of some solid compound will dissolve in water is
All of these are factors we can easily control, and so if we repeat the
experiment with the same unknown compound, keeping the other factors the same,
we strongly expect to get the same result. That is, this prior information
tells us that theta should be (nearly?) 0 or (nearly?) 1, given any particular
values for the above four factors.
- the chemical identity of the sample,
- the quantity of sample,
- the quantity of water, and
- the temperature.
- p. 383, third full paragraph and onward:
[...] There is a conceptual
difficulty here, since
is a `probability for a
probability'. However, it can be removed by carrying the notion of a split
personality to extremes; instead of supposing that describes
the state of knowledge of any one person, imagine that we have a large
population of individuals who hold varying beliefs about the probability for
success, and that describes the distribution of their beliefs.
Critique: This artifice is unnecessary. Following Jaynes's advice to
start with the finite and take the infinite only as a well-defined limit, we
can begin by considering a case of trials, and define
. Our distribution for theta is then a probability of a
frequency, not a probability of a probability, and there is no conceptual
difficulty. We then take the limit as
Is it possible that, although each individual holds a
opinion, the population as a whole is completely ignorant of ? What
distribution describes a population in a state of total confusion
on the issue? [...]
Now suppose that, before the experiment is performed, one more definite piece
of evidence E is given simultaneously to all of them. Each individual will
change his state of belief according to Bayes' theorem; Mr. , who had
previously held the probability for success to be
will change it to
[...] This new evidence thus generates a mapping of the parameter space
onto itself, given from (12.43) by
[...] If the population as a whole can learn nothing from this new evidence,
then it would seem reasonable to say that the population has been reduced, by
conflicting propaganda, to a state of total confusion on the issue. We
therefore define the state of `total confusion' or `complete ignorance' by the
condition that, after the transformation (12.44), the number of individuals
who hold beliefs in any given range
is the same
Critique: I find this characterization of complete ignorance to be
quite puzzling. I just don't see any reason why this corresponds to any
notion of complete ignorance. Furthermore, there are certain possible new
pieces of evidence that must
change the overall distribution of beliefs -- for example, might be
frequency data for the first trials, or even a definite statement about the
value of itself. There is also some ambiguity here. Inference about
only makes sense in the context of repeated trials; so, does above
really mean (success at -th trial) for some arbitrary ? If so, we
must also assume that is carefully chosen so that
has no dependence on (unobserved values of) , so that
remains independent of .
- p. 384, sentence following equation (12.43):
This new evidence thus
generates a mapping of the parameter space
given from (12.43) by
Critique: It seems to me that Jaynes is here committing an error that
warns against elsewhere: erroneously identifying distinct states of
information as the same. In particular, is a function of the particular
individual , since we are conditioning on different states of information
each individual. In my view, this destroys the entire construction, as we no
longer have the transformation (12.44).
Here is my alternate proposal for an ignorance prior, following
Jaynes's own advice. We begin with section 12.3, ``Continuous distributions,''
wherein Jaynes writes,
In the discrete entropy expression
we suppose that the discrete points ,
, become more and
more numerous, in such a way that, in the limit
If this passage to the limit is sufficiently well-behaved, [...] [t]he
discrete probability distribution will go over into a continuous
probability [...] The `invariant measure' function, is
proportional to the limiting density of discrete points.
Then at the beginning of p. 377, Jaynes writes,
Except for a constant factor, the measure is also the prior
distribution describing `complete ignorance' of .
On p. 376, last complete paragraph, Jaynes motivates the introduction of
invariance transformations by writing,
If the parameter space is not the result of any obvious limiting process,
what determines the proper measure ?
thus strongly implying that if there is an obvious limiting process, this is
the preferred method for constructing .
But in this problem there is, in fact, an obvious limiting process -- the one
mentioned at the beginning of this commentary. That is, we start by
considering a finite case of trials, define
, and define
as in section 3.1 (sampling without replacement). ( is 1 if the -th
trial is a success, and 0 otherwise.) Since has a finite set of
possible values, and ``ignorance'' means we are placing no constraints on the
distribution over theta, Chapter 11 tells us that we should use the
maximum-entropy distribution for , i.e., the uniform distribution over
In the limit as
while remains fixed we get
where , and the prior over turns into a uniform pdf
As a final note, I have some misgivings about even this solution. The problem
is that we are not, in fact, completely ignorant about . We know of
some additional structure to the problem -- that is, we know that
(in the finite case) is derived from the results of the trials via
could argue that we should therefore derive the prior over from the
ignorance prior over
. As Jaynes discusses in Chapter 3 (?),
in the limit of
this amounts to a prior that gives
probability 1 to , and we find that we are incapable of learning--
Thus it seems that any nondegenerate prior for is, in some sense,
informative. At the very least, it tells us that the various trials are
subject to some common logical influence.
Commentary on 12.4.3: Other approaches
Arnold Zellner contributed the following references to other priors that have
been suggested for the binomial parameter (probability of success):
- Theory of Probability (1967), by Sir Harold Jeffreys, pp. 123-125,
contains a discussion of various priors for the binomial parameter. He
believes that the uniform prior is too flat at the end points and that the
goes up too much at the end
points, 0 and 1, placing too much probability mass in the vicinity of 0 and
1. Therefore he lumps some probability up at zero and some at 1 with the rest
spread uniformly between 0 and 1.
- Bayesian Analysis in Econometrics and Statistics, by Arnold
Zellner, pp. 117-118, discusses a ``maximal data information'' prior
. This is a
bowl-shaped density that is proper and whose value at 0 and 1 is twice its
value at 0.5. Elsewhere in the same book he discusses the derivation of
``maximal data information priors'' in more detail.
Zellner's maximal data information prior is defined as that prior which
maximizes a quantity defined as the prior average information in the data
pdf, minus the information in the prior pdf. The ``information'' here is
intended to be negative the entropy.
Zellner's approach to ignorance priors and Jaynes's approach in PTLOS appear
to be incompatible. Jaynes argues that the proper definition of entropy for a
continuous distribution involves use of the measure describing complete
ignorance for the sample space, so you must already have your
ignorance prior in hand before you can even define the entropy/information of
a prior pdf. Zellner agrees on the necessity of choosing an information
measure for defining the entropy of a continuous distribution, but
considers this to be a separate problem--much like that of choosing a
temperature scale (Celsius, Fahrenheit, or Kelvin)--from that of producing a
least informative prior density.
See also ``Some Aspects of the History of
Bayesian Information Processing'' (to appear, Journal of
Econometrics), which may be found here.
One may be confused by the fact that integrating out of
(defined in (12.67)) and doing the appropriate change of variables from to
does not yield (12.68). This is because is not, strictly
speaking, a pdf in the variables and -- it is an area
density. The pdf is actually
. (See the first
paragraph under ``Rotational invariance,'' where Jaynes writes ``What