kernel_control_bwd#

Backward in time stochastic optimal control.

The backward in time (dynamic programming) stochastic optimal control algorithm computes the control actions working backward in time from the terminal time step to the current time step. It computes a sequence of “value” functions, and then as the system evolves forward in time, it chooses a control action that optimizes the value function, rather than the actual cost.

The policy is specified as a sequence of stochastic kernels \(\pi = \lbrace \pi_{0}, \pi_{1}, \ldots, \pi_{N-1} \rbrace\). Typically, the cost is formulated as an additive cost structure, where at each time step the system incurs a stage cost \(g_{t}(x_{t}, u_{t})\), and at the final time step, it incurs a terminal cost \(g_{N}(x_{N})\).

\[\min_{\pi} \quad \mathbb{E} \biggl[ g_{N}(x_{N}) + \sum_{t=0}^{N-1} g_{t}(x_{t}, u_{t}) \biggr]\]

In dynamic programming, we solve the problem iteratively, by considering each time step independently. We can do this by defining a sequence of value functions \(V_{0}, \ldots, V_{N}\) that describe a type of “overall” cost at each time step, starting with the terminal cost \(V_{N}(x) = g_{N}(x)\), and then substituting the subsequent value function into the current one. Then, the policy is chosen to minimize (or maximize, as is the convention for RL) the value functions.

\[V_{t}(x) = \max_{\pi_{t}} \quad \int_{\mathcal{U}} \int_{\mathcal{X}} g_{t}(x_{t}, u_{t}) + V_{t+1}(y) Q(\mathrm{d} y \mid x, u) \pi_{t}(\mathrm{d} u \mid x)\]

Note

See examples.benchmark_tracking_problem for a complete example.

class gym_socks.algorithms.control.kernel_control_bwd.KernelControlBwd(time_horizon=None, cost_fn=None, constraint_fn=None, heuristic=False, regularization_param=None, kernel_fn=None, batch_size=None, verbose=True, *args, **kwargs)[source]#

Stochastic optimal control policy backward in time.

Computes the optimal policy using dynamic programming. The solution computes an approximation of the value functions starting at the terminal time and working backwards. Then, when the policy is evaluated, it moves forward in time, optimizing over the value functions and choosing the action which has the highest “value”.

Parameters
  • S – Sample taken iid from the system evolution.

  • A – Collection of admissible control actions.

  • cost_fn – The cost function. Should return a real value.

  • constraint_fn – The constraint function. Should return a real value.

  • heuristic (bool) – Whether to use the heuristic solution instead of solving the LP.

  • regularization_param (float) – Regularization prameter for the regularized least-squares problem used to construct the approximation.

  • kernel_fn – The kernel function used by the algorithm.

  • verbose (bool) – Whether the algorithm should print verbose output.

  • time_horizon (int) –

  • batch_size (int) –

__call__(time=0, state=None, *args, **kwargs)[source]#

Evaluate the policy.

Returns

An action in the action space.

train(S, A)[source]#

Train the algorithm.

Parameters
  • S (numpy.ndarray) – Sample taken iid from the system evolution.

  • A (numpy.ndarray) – Collection of admissible control actions.

Returns

An instance of the KernelControlFwd algorithm class.

gym_socks.algorithms.control.kernel_control_bwd.kernel_control_bwd(S, A, time_horizon=None, cost_fn=None, constraint_fn=None, heuristic=False, regularization_param=None, kernel_fn=None, batch_size=None, verbose=True)[source]#

Stochastic optimal control policy backward in time.

Computes the optimal policy using dynamic programming. The solution computes an approximation of the value functions starting at the terminal time and working backwards. Then, when the policy is evaluated, it moves forward in time, optimizing over the value functions and choosing the action which has the highest “value”.

Parameters
  • S (numpy.ndarray) – Sample taken iid from the system evolution.

  • A (numpy.ndarray) – Collection of admissible control actions.

  • cost_fn – The cost function. Should return a real value.

  • constraint_fn – The constraint function. Should return a real value.

  • heuristic (bool) – Whether to use the heuristic solution instead of solving the LP.

  • regularization_param (Optional[float]) – Regularization prameter for the regularized least-squares problem used to construct the approximation.

  • kernel_fn – The kernel function used by the algorithm.

  • verbose (bool) – Whether the algorithm should print verbose output.

  • time_horizon (Optional[int]) –

  • batch_size (Optional[int]) –

TO21

Adam J. Thorpe and Meeko M. K. Oishi. Stochastic optimal control via hilbert space embeddings of distributions. In 2021 60th IEEE Conference on Decision and Control (CDC), volume, 904–911. 2021. doi:10.1109/CDC45484.2021.9682801.

@inproceedings{thorpe2021stochastic,
    author    = {Thorpe, Adam J. and Oishi, Meeko M. K.},
    booktitle = {2021 Conference on Decision and Control (CDC)},
    title     = {Stochastic Optimal Control via Hilbert Space Embeddings of Distributions},
    year      = {2021},
    volume    = {},
    number    = {},
    pages     = {},
    doi       = {}
}