Tanay Biradar

🌳 Understanding the Optimal Value Function in LQR MDPs

I'm currently taking an introduction to robot learning class, and I got a little confused by this derivation. These are some notes I made while attempting to understand it better. Note that this won't be like some of my explainer blog posts.

Reference material is thanks to the Winter 2025 CS291I lecture notes from Prof. James Preiss at UC Santa Barbara. None of these insights or formulas are my own. I am just rehashing them into a way that I understand well.

Background

We will first refresh ourselves with the standard formulation of Markov Decision Processes: Given an MDP $\mathcal{M} = (\mathcal{S}, \mathcal{A}, c, P, \mu_0)$, we want to find the optimal policy $\pi^\star$ that minimizes cost—or identically, maximizes reward.

We will focus on minimizing cost in this example.

$\mathcal{S}$: The state space
$\mathcal{A}$: The action space
$P$: The transition probability function (the dynamics)
$c$: The cost function tells us the immediate reward (identically, reward) received when taking an action in a state
$\mu_0$: The initial state distribution

First, remember that the value function $V^\pi$ maps states to actions ($V^\pi: \mathcal{S}\mapsto\mathcal{A}$). The value function gives the expected cost (throughout the trajectory/episode) if you always act according to your policy $\pi$ after starting at a state $s$.

The exact formula for the value function depends on the setting. As an example, in the infinite-horizon discounted setting:

$$V^\pi(s) = \mathbb{E}_{a \sim \pi(\cdot | s)} \left[ c(s,a) + \gamma \mathbb{E} \left[V^\pi(s')\right] \right] \tag{1}$$

For some reason the markdown renderer isn't working properly. In that equation, the inner expectation is intended to be $\mathbb{E}_{s' \sim P(\cdot | s, a)}$.

If we break the above equation down, we:

Sample an action according to our policy ($a \sim \pi(\cdot | s)$)
Transition to the next state $s'$ according to the transition probability ($s' \sim P(\cdot | s, a)$)
Then we recursively get the expected value from the next state ($V^\pi(s')$)

Next, we have the optimal value function $V^\star$. In any state, $V^\star(s)$ gives the best possible value (lowest expected cost) that any policy could have had in state $s$. Formally,

$$V^\star(s) = \min_\pi V^{\pi} \tag{2}$$

We can find the optimal policy $\pi^\star$ (the one that minimizes expected future cost) by first computing $V^\star$ and then "acting greedily" according to $V^\star$.

$$\pi^\star = \arg\min_a c(s,a) + \gamma\mathbb{E}_{s' \sim P(\cdot | s, a)} \left[V^\star(s')\right] \tag{3}$$

Problem Statement

If it's easy to find $\pi^\star$ policy once we have $V^\star$, then the question is: how do we find the optimal value function $V^\star$ for the MDP?

For MDPs with discrete state spaces, we can use algorithms like value iteration to find $V^\star$.

But what if we have a continuous state space (like in a robotics application)? In that case, we can't hope to tabulate $V^\star(s)$ when there are an uncountably infinite possibilities for $s \in \mathcal{S}$.

In one case, however, the Linear Quadratic Regulator (LQR) MDP, we have a nice way to compute the optimal value function. We'll understand the method for computing $V^\star(s)$ in the finite-horizon, discrete-time case.

Setup: $V^\star$ in the LQR MDP

First, let's define the deterministic state transition dynamics and the cost function:

$$ s_{t+1} = As_t + Ba_t \quad c(s, a) = s^TQs + a^TRa \tag{4} $$

We also are given that $Q$ and $R$ are positive-semi-definite (PSD).

Note: Time-Dependent Value Functions

Unlike the infinite-horizon setting, the optimal value function in the finite-horizon setting depends on time. So we can't use the same infinite-horizon value function in equation $(1)$.

Why? Here's an example: In the infinite-horizon setting, we'd just keep transitioning to $s_1$ in order to minimize cost. But in the finite-horizon case, we'd transition to $s_2$ at the very last time step.

LQR Value Function

We can formulate our value function for this LQR by modifying $(1)$. Here the policy is deterministic, and so are the state transitions. Then, with $a = \pi(s)$ and $s' = P(s, a)$,

$$V_{t+1}^\pi(s) = c(s,a) + V_t^\pi(s')\tag{5}$$ $$V_{t+1}^\star(s) = \min_a \left[ c(s,a) + V_t^\star(s') \right] \tag{6}$$

Since we're in a finite-horizon setting, we also got rid of the discount factor $\gamma$. Remember that $V^\star$ also depends on time, so we'll denote $V_t^\pi$ for time $t$.

Dynamic Programming: Method for Computing $V^\star$ in the LQR MDP

We can now compute $V^\star$ in an inductive, dynamic-programming manner.

This is our plan:

Base Case: Compute $V_H^\star$ (optimal value funciton at the final timestep $H$)
Inductive Step: Given $V_{t+1}^\star$, we want to get $V_h^\star$. In other words, work backwards to derive $V^\star$ in previous timesteps.

Base Case

In the final timestep $t=H$, we know that $c(s, a) = s^TQs + a^TRa$. But since we're in the final timestep, there is no action to be taken, so $a=0$. Thus,

$$ V_H^\star(s) = s^TQs $$

(Since $Q$ is PSD, this value function is non-negative—which passes the sanity check.)

Inductive Step

Now, we want to derive the $V_t^\star$ (for the "previous" step) given $V^\star_{t+1}$. Since we're working backwards in time, the time subscripts are the opposite of $(6)$.

$$V_{t}^\star(s) = \min_a \left[ c(s,a) + V_{t+1}^\star(s_{t+1}) \right] \tag{7}$$

Let's make the inductive hypothesis that $V^\star_{t+1} = s_{t+1}^T P_{t+1} s_{t+1}$ for some PSD matrix $P$.

What's the motivation for choosing this inductive hypothesis? We assume by induction that $V^\star$ for "later timesteps" has already been computed. This means that $V$ for these timesteps has already minimized over all possible actions. Then, $V^\star$ can be encoded as some function of the state only. We choose it to be a quadratic function parameterized by some PSD matrix $P$ (this makes it easier to optimize).

Plugging in $c$ and $V^\star_{t+1}$, then, we get

$$V_{t}^\star(s) = \min_a \left[ s^TQs + a^TRa + s_{t+1}^T P_{t+1} s_{t+1}) \right] \tag{8}$$

Since $Q$, $R$, and $P$ are all positive semi-definite, the inside of the $\min$ is a convex quadratic function. Thus, taking the gradient of the inside with respect to $a$ and setting it to $0$ will yield the optimal $a$.

This is a nice example of a closed solution to a special type of MDP with an (uncountably) infinite state space!