ksvanhorn.com
Home
Bayes Home
Jaynes Errata
Articles
Books
Software
Contact
next up previous

Subsections



Chapter 12: Ignorance priors and transformation groups

  • p. 375, equation (12.7): Insert a minus sign in front of the integral.

  • p. 378, line 4: ``Harr'' should be ``Haar.''

  • p. 378, equation (12.19): `` $\psi(x', \nu', \sigma')$'' should be `` $\psi(x', \nu', \sigma') dx'$ ''.

  • p. 381, second half of page: ``lies in the equations $x' = a x + b$, $\nu'
= a x + b$'' should be ``lies in the equations $x' = a x + b$, $\nu'
= a \nu + b$''.

  • p. 382, equation (12.37): The right-hand-side of the equation is wrong; it should be `` $\exp(- \lambda t) (\lambda t)^{n} / n!$''.

  • p. 384, equation (12.48): the denominator of the left-hand side should be $1-\theta+a\theta$.

  • p. 385, equation (12.51): ``$(n-1)!$'' should be ``$(n-2)!$''.

  • p. 386, fourth full paragraph: ``Kendell'' should be ``Kendall.''

  • p. 394, third full paragraph, second line: ``James Clark Maxwell'' should be ``James Clerk Maxwell.''

Commentary on 12.4.3: Unknown probability for success

Chapters 11 and 12 I found quite exciting and useful, as construction of reasonable priors is a subject that seems to get short shrift in most books on Bayesian methods, and the notion of an objective prior, that encodes exactly the information one has at hand and nothing more, is quite appealing.

However, I disagree with Jaynes's construction in 12.4.3 of an ignorance prior for an "unknown probability for success" $\theta$, which he concludes should be an improper prior proportional to $\theta^{-1} (1 -
\theta)^{-1}$ over the interval $[0,1]$. (This appears to have been first suggested as an ignorance prior by J. Haldane in 1932.) I will argue that Jaynes's rules point to the uniform distribution over $[0,1]$ as the appropriate ignorance prior. I'll begin by critiquing specific passages in 12.4.3.

  • p. 383, second full paragraph:
    For example, in a chemical laboratory we find a jar containing an unknown and unlabeled compound. We are at first completely ignorant as to whether a small sample of this compound will dissolve in water or not. But, having observed that one small sample does dissolve, we infer immediately that all samples of this compound are water soluble, and although this conclusion does not carry quite the force of deductive proof, we feel strongly that the inference was justified. Yet the Bayes-Laplace rule [uniform prior] leads to a negligibly small probability for this being true, and yields only a probability of $2/3$ that the next sample tested will dissolve.

    Critique: This example is irrelevant for evaluating proposed ignorance priors over $\theta$, as this is a situation where we have quite substantial prior information. We know that the relevant information in determining whether a sample of some solid compound will dissolve in water is

    • the chemical identity of the sample,
    • the quantity of sample,
    • the quantity of water, and
    • the temperature.
    All of these are factors we can easily control, and so if we repeat the experiment with the same unknown compound, keeping the other factors the same, we strongly expect to get the same result. That is, this prior information tells us that theta should be (nearly?) 0 or (nearly?) 1, given any particular values for the above four factors.

  • p. 383, third full paragraph and onward:
    [...] There is a conceptual difficulty here, since $f(\theta) d\theta$ is a `probability for a probability'. However, it can be removed by carrying the notion of a split personality to extremes; instead of supposing that $f(theta)$ describes the state of knowledge of any one person, imagine that we have a large population of individuals who hold varying beliefs about the probability for success, and that $f(theta)$ describes the distribution of their beliefs.

    Critique: This artifice is unnecessary. Following Jaynes's advice to start with the finite and take the infinite only as a well-defined limit, we can begin by considering a case of $n$ trials, and define $\theta = (\mbox{\char93
successes})/ n$. Our distribution for theta is then a probability of a frequency, not a probability of a probability, and there is no conceptual difficulty. We then take the limit as $n \rightarrow \infty$.

  • Continuing:
    Is it possible that, although each individual holds a definite opinion, the population as a whole is completely ignorant of $\theta$? What distribution $f(theta)$ describes a population in a state of total confusion on the issue? [...]

    Now suppose that, before the experiment is performed, one more definite piece of evidence E is given simultaneously to all of them. Each individual will change his state of belief according to Bayes' theorem; Mr. $X$, who had previously held the probability for success to be

    \begin{displaymath}
\theta = p(S \mid X)\qquad\mbox{(12.42)}
\end{displaymath}

    will change it to

    \begin{displaymath}
\theta' = p(S \mid E,X) = \mbox{[omitted]}\qquad\mbox{(12.43)}
\end{displaymath}

    [...] This new evidence thus generates a mapping of the parameter space $0 \leq \theta \leq 1$ onto itself, given from (12.43) by

    \begin{displaymath}
\theta' = \frac{a \theta}{1 - \theta + a \theta}\qquad\mbox{(12.44)}
\end{displaymath}

    [...] If the population as a whole can learn nothing from this new evidence, then it would seem reasonable to say that the population has been reduced, by conflicting propaganda, to a state of total confusion on the issue. We therefore define the state of `total confusion' or `complete ignorance' by the condition that, after the transformation (12.44), the number of individuals who hold beliefs in any given range $\theta_1 < \theta < \theta_2$ is the same as before.

    Critique: I find this characterization of complete ignorance to be quite puzzling. I just don't see any reason why this corresponds to any notion of complete ignorance. Furthermore, there are certain possible new pieces of evidence $E$ that must change the overall distribution of beliefs -- for example, $E$ might be frequency data for the first $N$ trials, or even a definite statement about the value of $\theta$ itself. There is also some ambiguity here. Inference about $\theta$ only makes sense in the context of repeated trials; so, does $S$ above really mean $S_i$ (success at $i$-th trial) for some arbitrary $i$? If so, we must also assume that $E$ is carefully chosen so that $p(E \mid S_i, X)$ has no dependence on (unobserved values of) $i$, so that $p(S_i \mid E, X)$ remains independent of $i$.

  • p. 384, sentence following equation (12.43):
    This new evidence thus generates a mapping of the parameter space $0 \leq \theta \leq 1$ onto itself, given from (12.43) by

    \begin{displaymath}
\theta' = \frac{a \theta}{1 - \theta + a \theta}\qquad\mbox{(12.44)}
\end{displaymath}

    where

    \begin{displaymath}
a = \frac{p(E \mid S, X)}{p(E \mid F, X)}. \qquad\mbox{(12.45)}
\end{displaymath}

    Critique: It seems to me that Jaynes is here committing an error that he warns against elsewhere: erroneously identifying distinct states of information as the same. In particular, $a$ is a function of the particular individual $X$, since we are conditioning on different states of information for each individual. In my view, this destroys the entire construction, as we no longer have the transformation (12.44).

Here is my alternate proposal for an ignorance prior, following Jaynes's own advice. We begin with section 12.3, ``Continuous distributions,'' wherein Jaynes writes,

In the discrete entropy expression

\begin{displaymath}
H_I^d = -\sum_{i=1}^n p_i log[p_i]
\end{displaymath}

we suppose that the discrete points $x_i$, $i = 1,2,\ldots,n$, become more and more numerous, in such a way that, in the limit $n \rightarrow \infty$,

\begin{displaymath}
\lim_{n->\infty} (\mbox{no. of points in $a < x < b$})/n = \int_a^b dx m(x).
\end{displaymath}

If this passage to the limit is sufficiently well-behaved, [...] [t]he discrete probability distribution $p_i$ will go over into a continuous probability $p(x \mid I)$ [...] The `invariant measure' function, $m(x)$ is proportional to the limiting density of discrete points.

Then at the beginning of p. 377, Jaynes writes,

Except for a constant factor, the measure $m(x)$ is also the prior distribution describing `complete ignorance' of $x$.

On p. 376, last complete paragraph, Jaynes motivates the introduction of invariance transformations by writing,

If the parameter space is not the result of any obvious limiting process, what determines the proper measure $m(x)$?
thus strongly implying that if there is an obvious limiting process, this is the preferred method for constructing $m(x)$.

But in this problem there is, in fact, an obvious limiting process -- the one mentioned at the beginning of this commentary. That is, we start by considering a finite case of $n$ trials, define $\theta = (\mbox{\char93
successes})/ n$, and define

\begin{displaymath}
p(x_1, ..., x_n \vert \theta, I)
\end{displaymath}

as in section 3.1 (sampling without replacement). ($x_i$ is 1 if the $i$-th trial is a success, and 0 otherwise.) Since $\theta$ has a finite set of possible values, and ``ignorance'' means we are placing no constraints on the distribution over theta, Chapter 11 tells us that we should use the maximum-entropy distribution for $\theta$, i.e., the uniform distribution over

\begin{displaymath}
0, 1/n, 2/n, ..., (n-1)/n, 1.
\end{displaymath}

In the limit as $n \rightarrow \infty$ while $k$ remains fixed we get

\begin{displaymath}
p(x_1, ..., x_k \mid \theta, I) =
\theta^s (1 - \theta)^{n-s},
\end{displaymath}

where $s=\sum_i x_i$, and the prior over $\theta$ turns into a uniform pdf over $[0,1]$.

As a final note, I have some misgivings about even this solution. The problem is that we are not, in fact, completely ignorant about $\theta$. We know of some additional structure to the problem -- that is, we know that $\theta$ (in the finite case) is derived from the results of the trials $x_i$ via $\theta = \sum_i x_i/n$. One could argue that we should therefore derive the prior over $\theta$ from the ignorance prior over $x_1,\ldots,x_n$. As Jaynes discusses in Chapter 3 (?), in the limit of $n \rightarrow \infty$ this amounts to a prior that gives probability 1 to $\theta=1/2$, and we find that we are incapable of learning--

\begin{displaymath}
p(x_{k+1} \mid x_1,\ldots,x_k, I) = p(x_{k+1} \mid I) = 1/2.
\end{displaymath}

Thus it seems that any nondegenerate prior for $\theta$ is, in some sense, informative. At the very least, it tells us that the various trials are subject to some common logical influence.


Commentary on 12.4.3: Other approaches

Arnold Zellner contributed the following references to other priors that have been suggested for the binomial parameter (probability of success):

  • Theory of Probability (1967), by Sir Harold Jeffreys, pp. 123-125, contains a discussion of various priors for the binomial parameter. He believes that the uniform prior is too flat at the end points and that the improper prior $\theta^{-1} (1 -
\theta)^{-1}$ goes up too much at the end points, 0 and 1, placing too much probability mass in the vicinity of 0 and 1. Therefore he lumps some probability up at zero and some at 1 with the rest spread uniformly between 0 and 1.

  • Bayesian Analysis in Econometrics and Statistics, by Arnold Zellner, pp. 117-118, discusses a ``maximal data information'' prior proportional to $\theta^{\theta} (1-\theta)^{1-\theta}$. This is a bowl-shaped density that is proper and whose value at 0 and 1 is twice its value at 0.5. Elsewhere in the same book he discusses the derivation of ``maximal data information priors'' in more detail.

Zellner's maximal data information prior is defined as that prior which maximizes a quantity $G$ defined as the prior average information in the data pdf, minus the information in the prior pdf. The ``information'' here is intended to be negative the entropy.

Zellner's approach to ignorance priors and Jaynes's approach in PTLOS appear to be incompatible. Jaynes argues that the proper definition of entropy for a continuous distribution involves use of the measure $m(x)$ describing complete ignorance for the sample space, so you must already have your ignorance prior in hand before you can even define the entropy/information of a prior pdf. Zellner agrees on the necessity of choosing an information measure $m(x)$ for defining the entropy of a continuous distribution, but considers this to be a separate problem--much like that of choosing a temperature scale (Celsius, Fahrenheit, or Kelvin)--from that of producing a least informative prior density.

See also ``Some Aspects of the History of Bayesian Information Processing'' (to appear, Journal of Econometrics), which may be found here.

Commentary on 12.4.4: Bertrand's problem

One may be confused by the fact that integrating $\theta$ out of $f(r,\theta)$ (defined in (12.67)) and doing the appropriate change of variables from $r$ to $x$ does not yield (12.68). This is because $f(r,\theta)$ is not, strictly speaking, a pdf in the variables $r$ and $\theta$ -- it is an area density. The $(r,\theta)$ pdf is actually $r f(r,\theta)$. (See the first paragraph under ``Rotational invariance,'' where Jaynes writes ``What probability density $f(r,\theta) dA = f(r,\theta) r dr d\theta$ should we assign...'')

next up previous