Dynamic programming, the basic idea
A while ago, AnonEcon used dynamic programming to solve a problem I posted; until that point, I’d only seen dynamic programming mentioned as a heuristic method for approaching some machine learning or computational geometry type questions, so I hadn’t had too much reason to find out more about it. But he demonstrated that it was more than just an algorithm, so I added it to my list of things it might be beneficial to know.
I picked up a copy of Bertsekas’ Dynamic Programming and Optimal Control book yesterday; so far it’s an enthralling book. The first introduction to dynamic programming I’ve seen that makes it clear what dynamic programming is. And it does look powerfully useful. I’m about half way through Chapter 1, which I hope to finish up later today and then review. I’m also going to attempt some of the problems– they look like fun too. (All the results and notations in this post are taken almost verbatim from Chapter 1 of Bertsekas’ book).
Like all other forms of mathematical programming, dynamic programming deals with optimization. Its goal is to balance the long-term cost of the aggregate decisions you made with the short-term cost of each individual decisions– you’d like to make decisions that make the optimum balance between short term and long term costs. This is particularly interesting when you just have a finite number of decisions to make, because in this case it seems like there would be some point where you would switch strategies. For example, if you were near the start of your process you’d be more worried about long term than short term costs, but at the last stage, you’d be worried only about short term costs. How do you know when to switch strategies?
Specifically, dynamic programming deals with the optimization of the expected cost of discrete-time dynamic systems with additive cost functions. This dynamic system determines the next state
of a system based upon the current state of the system
, a control
selected by the operator at each stage, and a random parameter
:
In the finite horizon case,
varies between
and in the infinite horizon case
, the number of decision-making stages, is inifinite. There is a cost function which determines the cost of selecting a particular control at each step:
, and in the finite horizon case, a terminal cost function
which determines the cost of ending in a given state.
A policy, or control law,
is a series of strategies
for selecting
based on
. The goal is to choose
such that it minimizes the expected value of the total cost
:
where the expectation is taken w.r.t the joint distribution of the r.v.s
.
This is of course a daunting problem in general, but the creator of dynamic programming, Richard Bellman (incidentally you might think his DP book would be a good one to learn from, but it is definitely not), formulated the following rather obvious principle of optimality
Let
be an optimal policy for the problem, and assume that when using
, a given state
occurs at time
with positive probability. Consider the subproblem whereby we are at
at time
and wish to minimize the “cost-to-go” from time
to time
![]()
Then the truncated policy
is optimal for this subproblem.
which suggests the following proposition/algorithm:
For every initial state
, the optimal cost
of the basic problem is equal to
given by the last step of the following algorithm that proceeds backward in time from period
to period
:
![]()
where the expectation is taken with respect to the probability distribution of
, which depends on
and
. Furthermore, if
minimizes the right hand side of the second equation above for each
and
, the policy
is optimal.
That’s the gist of it. If you can’t follow the above procedure analytically, then you can do it numerically. Pretty neat. At some point I’ll put up examples of it in use.
Possibly relevant posts:
- Complexification of the Rademacher comparison theorem? (8/18/2008)
- Convex Optimization and the Barrier Method (4/1/2008)
- Stochastic Lyapunov functions (5/6/2007)
be an optimal policy for the problem, and assume that when using
, a given state
occurs at time
with positive probability. Consider the subproblem whereby we are at 
is optimal for this subproblem.
, the optimal cost
of the basic problem is equal to
given by the last step of the following algorithm that proceeds backward in time from period
to period
:
minimizes the right hand side of the second equation above for each
is optimal.