Home
Bayes Home
Jaynes Errata
Articles
Books
Software
Contact

Subsections
 p. 359, equation (11.63): insert minus sign before ``
''.
 p. 360, line 3: ``'' should be
``
''.
 p. 360, equation (11.65): to be consistent,
``
'' should be ``
''.
 p. 360, equation (11.69): insert minus sign before ``
''.
 p. 361, equation (11.72), second line:
``
'' should be ``
''.
 p. 362, equation (11.81): Should ``
'' be
``
''?
 p. 367, equation (11.92): ``'' should be ``''.
 p. 368, secondtolast paragraph: I can't make any sense of
this. Can anyone explain this or provide some examples?
Commentary: Computing parameters of a maxent distribution
Unfortunately, Jaynes doesn't say much about how one finds the specific
parameter values that achieve the desired expectations .
When the functions have bounded, nonnegative values, the generalized
iterative scaling and improved iterative scaling algorithms, discussed in the
following references, can be used:
 J. Darroch and D. Ratcliff, ``Generalized iterative scaling for
loglinear models,'' Ann. Math. Statist. 43, 14701480, 1972.
 S. Della Pietra, V. Della Pietra, and J. Lafferty, ``Inducing features
of random fields,'' IEEE Trans. on Pattern Analysis and Machine
Intelligence 19, number 4, pp. 380393, 1997. (Available
here.)
 A. Berger, ``The improved iterative scaling algorithm: a gentle
introduction.'' (Available
here.)
These algorithms are most useful when the partition function cannot be
efficiently computed.
If the partition function can be efficiently computed, then one
can find the parameter values that produce the desired expected values
by maximizing the function
Note that is just the times the log of the likelihood
function for the maxent form when the data are such that the average of each
is . The reason this works is that
at the maximum, the derivatives are zero, so
.
To better understand why maximizing is useful, let us consider the
discrepancy
(a.k.a. directed divergence or
KullbackLiebler divergence) between an approximation to a distribution
and the distribution itself. This is defined as
,
where the expectation is taken over . The discrepancy is always
nonnegative, and equal to zero only if the two distributions are identical.
If base2 logarithms are used, the discrepancy may be thought of as the number
of bits of information lost by using the approximation.
We are interested in a particular parameter vector that gives
for all . Our current best guess
for that parameter vector defines a distribution that may be considered an
approximation to the distribution obtained using the unknown . The
discrepancy between these is
So increasing decreases the discrepancy from the desired
distribution.
