somewhere near the beginning.

Dynamic programming, the basic idea

Filed under: Mathematics — Alex @ 3:58 pm 3/19/2007

A while ago, AnonEcon used dynamic programming to solve a problem I posted; until that point, I’d only seen dynamic programming mentioned as a heuristic method for approaching some machine learning or computational geometry type questions, so I hadn’t had too much reason to find out more about it. But he demonstrated that it was more than just an algorithm, so I added it to my list of things it might be beneficial to know.

I picked up a copy of Bertsekas’ Dynamic Programming and Optimal Control book yesterday; so far it’s an enthralling book. The first introduction to dynamic programming I’ve seen that makes it clear what dynamic programming is. And it does look powerfully useful. I’m about half way through Chapter 1, which I hope to finish up later today and then review. I’m also going to attempt some of the problems– they look like fun too. (All the results and notations in this post are taken almost verbatim from Chapter 1 of Bertsekas’ book).

Like all other forms of mathematical programming, dynamic programming deals with optimization. Its goal is to balance the long-term cost of the aggregate decisions you made with the short-term cost of each individual decisions– you’d like to make decisions that make the optimum balance between short term and long term costs. This is particularly interesting when you just have a finite number of decisions to make, because in this case it seems like there would be some point where you would switch strategies. For example, if you were near the start of your process you’d be more worried about long term than short term costs, but at the last stage, you’d be worried only about short term costs. How do you know when to switch strategies?

Specifically, dynamic programming deals with the optimization of the expected cost of discrete-time dynamic systems with additive cost functions. This dynamic system determines the next state x_{k+1} of a system based upon the current state of the system x_k, a control u_k selected by the operator at each stage, and a random parameter w_k:

 x_{k+1} = f_k(x_k, u_k, w_k).

In the finite horizon case, k varies between 0,1,\ldots,N-1 and in the infinite horizon case k, the number of decision-making stages, is inifinite. There is a cost function which determines the cost of selecting a particular control at each step: g_k(x_k, u_k, w_k), and in the finite horizon case, a terminal cost function g_N(x_N) which determines the cost of ending in a given state.

A policy, or control law, \pi = \{\mu_0, \mu_1, \ldots, \mu_{N-1}\} is a series of strategies \mu_k for selecting u_k based on x_k. The goal is to choose \pi such that it minimizes the expected value of the total cost J_{\pi}(x_0):

 \pi^* = \argmin_{\pi}J_{\pi}(x_0), \quad J_{\pi}(x_0) := \mathbb{E}\left[g_N(x_N) + \sum_{k=0}^{N-1} g_k(x_k, u_k, w_k)\right]

where the expectation is taken w.r.t the joint distribution of the r.v.s x_k,u_k,w_k.

This is of course a daunting problem in general, but the creator of dynamic programming, Richard Bellman (incidentally you might think his DP book would be a good one to learn from, but it is definitely not), formulated the following rather obvious principle of optimality

Let \pi^* = \{\mu_0^*, \mu_1^*, \ldots, \mu_{N-1}^*\} be an optimal policy for the problem, and assume that when using \pi^*, a given state x_i occurs at time i with positive probability. Consider the subproblem whereby we are at x_i at time i and wish to minimize the “cost-to-go” from time i to time N

\mathbb{E}\left[g_N(X_N) + \sum_{k=i}^{N-1} g_k(x_k, \mu_k(x_k), w_k) \right].

Then the truncated policy \{\mu_i^*, \mu_{i+1}^*, \ldots, \mu_{N-1}^*\} is optimal for this subproblem.

which suggests the following proposition/algorithm:

For every initial state x_0, the optimal cost J^*(x_0) of the basic problem is equal to J_0(x_0) given by the last step of the following algorithm that proceeds backward in time from period N-1 to period 0:

 J_N(x_N) = G_N(x_N)

 J_k(x_k) = \min_{u_k} \mathbb{E}_{w_k}\left[g_k(x_k, u_k, w_k) + J_{k+1}(f_k(x_k, u_k, w_k))\right], \quad k=0,1,\ldots,N-1,

where the expectation is taken with respect to the probability distribution of w_k, which depends on x_k and u_k. Furthermore, if u_k^* = \mu_k^*(x_k) minimizes the right hand side of the second equation above for each x_k and k, the policy \pi^* = \{\mu_0^*, \ldots, \mu_{N-1}^*\} is optimal.

That’s the gist of it. If you can’t follow the above procedure analytically, then you can do it numerically. Pretty neat. At some point I’ll put up examples of it in use.

Possibly relevant posts:

No Comments »

No comments yet.

RSS feed for comments on this post. TrackBack URL

Leave a comment