Unfortunately, Jaynes doesn't say much about how one finds the specific parameter values that achieve the desired expectations . When the functions have bounded, nonnegative values, the generalized iterative scaling and improved iterative scaling algorithms, discussed in the following references, can be used:
If the partition function can be efficiently computed, then one
can find the parameter values that produce the desired expected values
by maximizing the function
Note that is just the times the log of the likelihood function for the maxent form when the data are such that the average of each is . The reason this works is that
at the maximum, the derivatives are zero, so .
To better understand why maximizing is useful, let us consider the discrepancy (a.k.a. directed divergence or Kullback-Liebler divergence) between an approximation to a distribution and the distribution itself. This is defined as , where the expectation is taken over . The discrepancy is always nonnegative, and equal to zero only if the two distributions are identical. If base-2 logarithms are used, the discrepancy may be thought of as the number of bits of information lost by using the approximation.
We are interested in a particular parameter vector that gives for all . Our current best guess for that parameter vector defines a distribution that may be considered an approximation to the distribution obtained using the unknown . The discrepancy between these is
So increasing decreases the discrepancy from the desired distribution.