Optimal Control II: Differential Dynamic Programming

The Differential Dynamic Programming (DDP) is a nowadays quite popular optimal control algorithm, which was introduced by Jacobson all the way back in the 1970s. Recently, it has made a resurgence following work by Tassa and Todorov¹.

The discrete-time optimal control problem obtained by classical transcription of a continuous-time problem is the following nonlinear program (NLP): $$ \newcommand{\bfx}{\bfm{x}} \newcommand{\bfu}{\bfm{u}} \begin{equation} \begin{aligned} &\min_{\bfx,\bfu} J(\bfx, \bfu) = \sum_{t=0}^{T-1} \ell(x_t, u_t) + h(x_T) \ &\suchthat \ x_{t+1} = f(x_t, u_t). \end{aligned} \tag{P} \label{eq:OCP} \end{equation} $$ where $ \ell $ and $ h $ are the running and terminal costs, and $ f $ the system dynamics.

Several methods for solving this problem exist. The obvious one is to use a general-purpose NLP solver such as IpOpt, which will yield a numerical solution after plugging in the problem's objective, constraints, and their derivatives. Providing the NLP solver with the sparsity pattern of the constraints will yield significantly better execution time. DDP, however, is a specific solver for OCPs of form \eqref{eq:OCP} and naturally handles the problem sparsity.

If you haven't yet, I recommend you read the first post on LQR to familiarize yourself with the notations used.

Algorithm

Backward pass

The OCP admits the following dynamic programming equation, which is a nonlinear variant of the LQR problem's dynamic programming principle, $$ \begin{equation} V_t(x) = \min_u \ell(x, u) + V_{t+1}(f(x,u)). \end{equation} $$ To solve this equation, DDP computes a Newton step in the control variate $ u $, for a given value $ x + \delta x $ of the previous stage's state $ x_t $ -- this allows us to approximate the Newton step $ \delta u $ as a function of $ \delta x $ using zeroth- and first-order sensitivities.

The Hamiltonian $ Q(x, u) $ has the following partial derivatives: $$ \begin{equation} \begin{aligned} q &= \ell(x,u) + V'(f(x,u)) \ Q_x &= \ell_x + f_x^\top V'x \ Q_u &= \ell_u + f_u^\top V'x \ Q{xx} &= \ell{xx} + f_x^\top V'{xx}f_x + V'x \cdot f{xx} \ Q{xu} &= \ell_{xu} + f_x^\top V'{xx}f_u + V'x \cdot f{xu} \ Q{uu} &= \ell_{uu} + f_u^\top V'_{xx}f_u + V'x \cdot f{uu} \end{aligned} \end{equation} $$

The Newton step for the dynamic programming equation is obtained by $$ \begin{equation} \delta u = -Q_{uu}^{-1}Q_u -Q_{uu}^{-1}Q_{ux}\delta x = k + \bfm{K}\delta x \end{equation} $$ where the vector $ k $ and matrix $ \bfm{K} \in \RR^{n_x\times n_u} $ are called the feedforward and feedback gains respectively. They are precisely the sensitivities of the Newton step $ \delta u $ with respect to a perturbation of the state $ x $.

The derivatives of the value function around $ x $ along the new trajectory are given by $$ \begin{equation} \begin{aligned} v &= q - \frac{1}{2}k^\top Q_{uu}k \ V_x &= Q_x + Q_{xu}k \ V_{xx} &= Q_{xx} + Q_{xu}\bfm{K}. \end{aligned} \end{equation} $$ This gives a local quadratic model of $ V_t(x + \delta x) $.

Forward pass

The forward pass is obtained by a so-called nonlinear rollout: let $ x_0^+ = x_0 $ and $$ \begin{equation} \begin{aligned} u_t^+ &= u_t + \alpha k_t + \bfm{K}t(x_t^+ - x_t) \ x{t+1}^+ &= f(x_t^+, u_t^+). \end{aligned} \end{equation} $$ where $ \alpha \in (0, 1] $ is a line-search parameter. This nonlinear rollout is where the DDP algorithm diverges completely from how a SQP method would work for the OCP.

The algorithm actually treats the OCP as an unconstrained problem $ \min_{\bfu} \mathcal{J}(\bfu) $ where the state variables $ \bfx $ are removed by using the recurrence relation.

Backtracking line search

To find the right $ \alpha $, a backtracking line search procedure is used: denoting $ \phi(\alpha) = \mathcal{J}(\bfu^+(\alpha)) $ the value of the objective for parameter $ \alpha $, we look for the largest $ \alpha $ satisfying $$ \begin{equation} \phi(\alpha) - \phi(0) \leq \Delta_1 \alpha + \frac{1}{2}\Delta_2\alpha^2. \end{equation} $$ with a backtracking line search procedure. The coefficients $ \Delta_1 $ and $ \Delta_2 $ are the directional derivatives of $ \mathcal{J}(\bfu) $ in the search direction obtained from the backward pass $$ \begin{equation} \Delta_1 = \sum_{t=0}^{T-1} Q_{u,t}k_t,\quad \Delta_2 = \sum_{t=0}^{T-1} k_t^\top Q_{uu,t} k_t. \end{equation} $$

Caveats

The algorithm has a few drawbacks: the transcription from the continuous-time problem requires getting explicit-form discrete dynamics -- and several time-explicit numerical integrators for ODEs and DAEs are known to have very poor performance or do not work well with constrained dynamics. Also, every single iterate of the algorithm produces a feasible trajectory, which is not always desirable. This latter aspect is tackled in recent literature, for instance².

Also, the standard DDP cannot solve operational constraints on the controlled trajectory, whether it be equality or inequality constraints, for instance requiring the terminal state $ x_T $ be equal to some target value $ \bar{x} $.

Y. Tassa, T. Erez, and E. Todorov, “Synthesis and stabilization of complex behaviors through online trajectory optimization,” Oct. 2012, pp. 4906–4913. doi: 10.1109/IROS.2012.6386025.

C. Mastalli et al., “Crocoddyl: An Efficient and Versatile Framework for Multi-Contact Optimal Control,” arXiv:1909.04947 [cs, math], Mar. 2020, Accessed: Nov. 18, 2020. [Online]. Available: http://arxiv.org/abs/1909.04947