Bayes Home
Jaynes Errata
next up previous


Chapter 11: Discrete prior probabilities: the entropy principle

  • p. 359, equation (11.63): insert minus sign before `` $\partial^{2} \log

  • p. 360, line 3: ``$f_{k}(x_{i})$'' should be `` $f_{k}(x_{i}; \alpha)$''.

  • p. 360, equation (11.65): to be consistent, `` $f_{k}(x_{i}, \alpha)$'' should be `` $f_{k}(x_{i}; \alpha)$''.

  • p. 360, equation (11.69): insert minus sign before `` $\partial^{2} \log

  • p. 361, equation (11.72), second line: `` $\sum_{i-1}^{n}$'' should be `` $\sum_{i=1}^{n}$''.

  • p. 362, equation (11.81): Should `` $\langle g_{j} f_{k}\rangle$'' be `` $\langle g_{j} g_{k}\rangle$''?

  • p. 367, equation (11.92): ``$d$'' should be ``$k$''.

  • p. 368, second-to-last paragraph: I can't make any sense of this. Can anyone explain this or provide some examples?

Commentary: Computing parameters of a maxent distribution

Unfortunately, Jaynes doesn't say much about how one finds the specific parameter values $\lambda_i$ that achieve the desired expectations $F_i$. When the functions $f_i(x)$ have bounded, nonnegative values, the generalized iterative scaling and improved iterative scaling algorithms, discussed in the following references, can be used:

  • J. Darroch and D. Ratcliff, ``Generalized iterative scaling for log-linear models,'' Ann. Math. Statist. 43, 1470-1480, 1972.
  • S. Della Pietra, V. Della Pietra, and J. Lafferty, ``Inducing features of random fields,'' IEEE Trans. on Pattern Analysis and Machine Intelligence 19, number 4, pp. 380-393, 1997. (Available here.)
  • A. Berger, ``The improved iterative scaling algorithm: a gentle introduction.'' (Available here.)
These algorithms are most useful when the partition function cannot be efficiently computed.

If the partition function $Z(\lambda)$ can be efficiently computed, then one can find the parameter values that produce the desired expected values $F_i$ by maximizing the function

h(\lambda) \equiv -\log Z(\lambda) -\sum_i \lambda_i F_i.

Note that $h(\lambda)$ is just the $1/N$ times the log of the likelihood function for the maxent form when the data are such that the average of each $f_i(x)$ is $F_i$. The reason this works is that

\frac{\partial h(\lambda)}{\partial\lambda_i} =
E[f_i(x) \mid \lambda] - F_i;

at the maximum, the derivatives are zero, so $E[f_i(x) \mid \lambda] = F_i$.

To better understand why maximizing $h$ is useful, let us consider the discrepancy $\delta(q\mid p)$ (a.k.a. directed divergence or Kullback-Liebler divergence) between an approximation $q$ to a distribution and the distribution $p$ itself. This is defined as $E[\log(p(x)/q(x))]$, where the expectation is taken over $p$. The discrepancy is always nonnegative, and equal to zero only if the two distributions are identical. If base-2 logarithms are used, the discrepancy may be thought of as the number of bits of information lost by using the approximation.

We are interested in a particular parameter vector $\lambda_0$ that gives $E[f_i(x) \mid \lambda_0] = F_i$ for all $i$. Our current best guess $\lambda$ for that parameter vector defines a distribution that may be considered an approximation to the distribution obtained using the unknown $\lambda_0$. The discrepancy between these is

\delta(p_{\lambda} \mid p_{\lambda_0})
&=& E[\log p(x\mid \l...
... F_i \\
&=& -h(\lambda) + \mbox{terms not involving $\lambda$}

So increasing $h(\lambda)$ decreases the discrepancy from the desired distribution.

next up previous