ksvanhorn.com
Home
Bayes Home
Jaynes Errata
Articles
Books
Software
Contact

# Chapter 11: Discrete prior probabilities: the entropy principle

• p. 359, equation (11.63): insert minus sign before  ''.

• p. 360, line 3: '' should be  ''.

• p. 360, equation (11.65): to be consistent,  '' should be  ''.

• p. 360, equation (11.69): insert minus sign before  ''.

• p. 361, equation (11.72), second line:  '' should be  ''.

• p. 362, equation (11.81): Should  '' be  ''?

• p. 367, equation (11.92): '' should be ''.

• p. 368, second-to-last paragraph: I can't make any sense of this. Can anyone explain this or provide some examples?

## Commentary: Computing parameters of a maxent distribution

Unfortunately, Jaynes doesn't say much about how one finds the specific parameter values that achieve the desired expectations . When the functions have bounded, nonnegative values, the generalized iterative scaling and improved iterative scaling algorithms, discussed in the following references, can be used:

• J. Darroch and D. Ratcliff, Generalized iterative scaling for log-linear models,'' Ann. Math. Statist. 43, 1470-1480, 1972.
• S. Della Pietra, V. Della Pietra, and J. Lafferty, Inducing features of random fields,'' IEEE Trans. on Pattern Analysis and Machine Intelligence 19, number 4, pp. 380-393, 1997. (Available here.)
• A. Berger, The improved iterative scaling algorithm: a gentle introduction.'' (Available here.)
These algorithms are most useful when the partition function cannot be efficiently computed.

If the partition function can be efficiently computed, then one can find the parameter values that produce the desired expected values by maximizing the function

Note that is just the times the log of the likelihood function for the maxent form when the data are such that the average of each is . The reason this works is that

at the maximum, the derivatives are zero, so .

To better understand why maximizing is useful, let us consider the discrepancy (a.k.a. directed divergence or Kullback-Liebler divergence) between an approximation to a distribution and the distribution itself. This is defined as , where the expectation is taken over . The discrepancy is always nonnegative, and equal to zero only if the two distributions are identical. If base-2 logarithms are used, the discrepancy may be thought of as the number of bits of information lost by using the approximation.

We are interested in a particular parameter vector that gives for all . Our current best guess for that parameter vector defines a distribution that may be considered an approximation to the distribution obtained using the unknown . The discrepancy between these is

So increasing decreases the discrepancy from the desired distribution.