Mini-Course on

Machine Learning Methods in Finance @ KAUST

From Signatures to Reinforcement Learning

Christian Bayer, Christoph Belak, Blanka Horvath, Paul Hager$\ast$

$\ast$ Humboldt University of Berlin

  1. May 2022

Outline of the Lecture

This lecture will be a practical guide to:

  1. calculating the truncated log-signatures for piecewise linear paths,
  1. implementing deep signature stopping rules,
  1. approximating the optimal stopping problem of fractional Brownian motion

1. Calculation of Log-Signatures

Notation: $\{\color{blue}{e_1}, \dots, \color{blue}{e_d}\}$ will denote a basis of $\mathbb{R}^d$.

1.1 Recap of Signatures of Smooth Paths

Recall that the signature of a smooth path $x: [0,T] \to \mathbb{R}^d$ is given by the sequence of iterated integerals as a formal tensor series

$$ \mathbb{x}^{<\infty}_{s,t} = 1 + \int_s^t dx_{t_1} + \int_s^t \int_s^{t_2} dx_{t_1} \otimes dx_{t_2} + \dots \quad \in \mathbb{T}((\mathbb{R}^d)),\quad (0 \le s \le t \le T).$$

In other words, the component of the signature corresponding to the multiindex or word $\color{blue}{i_1} \cdots \color{blue}{i_n} \in \{\color{blue}{1}, ..., \color{blue}{d}\}^n$ is given by

$$\langle \color{blue}{i_1} \cdots \color{blue}{i_n}, \mathbb{x}^{<\infty}_{s,t}\rangle = \idotsint_{s \le t_1 \le \dots \le t_n \le t} dx^{\color{blue}{i_1}}_{t_1} \dots dx^{\color{blue}{i_n}}_{t_n} $$

Furthermore, we have also seen that the signature satisfies the Chen relation

$$ \mathbb{x}^{<\infty}_{s,t} = \mathbb{x}^{<\infty}_{s,u} \otimes \mathbb{x}^{<\infty}_{u,t},\qquad (0 \le s \le u \le t \le T)$$

and the shuffle identity

$$ \langle w, \mathbb{x}^{<\infty}_{s,t}\rangle \langle v, \mathbb{x}^{<\infty}_{s,t}\rangle = \langle w \sqcup\mathchoice{\mkern-7mu}{\mkern-7mu}{\mkern-3.2mu}{\mkern-3.8mu}\sqcup v, \mathbb{x}_{s,t}\rangle, \qquad w, v \in \mathcal{W}_d.$$

The elements of of the tensor algebra satisfying the shuffle relation form a group - the free Lie group $G(\mathbb{R}^d)$ its tensor-logarithm $\mathfrak{g}(\mathbb{R}^d) = \log(G(\mathbb{R}^d))$ is the

1.2 The Free Lie algebra

The free Lie algebra can be characterized in another way.

Define the bilinear commutator bracket

$$[\cdot, \cdot]: T((\mathbb{R}^d)) \times T((\mathbb{R}^d)) \to T((\mathbb{R}^d)), \qquad [a, b] = a\otimes b - b \otimes a.$$

and note that it satisfies the following relations:

The free Lie algebra $\mathfrak{g}(\mathbb{R}^d)$ is the smallest sub-algebra of $T((\mathbb{R}^d))$ containing $\mathbb{R}^d$ that is closed under the the commutator bracket $[\cdot, \cdot]$.

In other words, it is the algebra generated by $\{\color{blue}{e_1}, \cdots \color{blue}{e_d}\}$ and $[\cdot, \cdot]$.

What does that mean for the signature?

There exists as sequence of bracketing expressions such that

$$\mathbb{x}^{<\infty}_{0,t} = \exp( a_1 \color{blue}{e_1} + \dots + a_d \color{blue}{e_d} + a_{12}[\color{blue}{e_1}, \color{blue}{e_2}] + \dots a_{(d-1)d}[\color{blue}{e_{d-1}}, \color{blue}{e_d}] \\+ a_{(d-1)d}[\color{blue}{e_1},[\color{blue}{e_1}, \color{blue}{e_2}]] + \dots).$$

This raises two further questions:

1.3 A Basis of the Free Lie Algebra

For simplicity let us assume that $d=2$ (the one-dimensional case yields no insight) and identify $\color{blue}{e_i} \leftrightarrow \color{blue}{i}$.

Let us look at the first bracketing terms:

  1. On the 1st level we have no brackets
    • $\color{blue}{1}$, $\color{blue}{2}$.

  1. On the 2nd level we obtain

    • $[\color{blue}{1}, \color{blue}{1}]$, $[\color{blue}{1}, \color{blue}{2}]$, $[\color{blue}{2}, \color{blue}{1}]$, $[\color{blue}{2}, \color{blue}{2}]$

      By definition we have $[\color{blue}{1}, \color{blue}{1}] = [\color{blue}{2}, \color{blue}{2}] = 0$ and $[\color{blue}{1}, \color{blue}{2}] = - [\color{blue}{2}, \color{blue}{1}]$.

      Hence we can single out one basis element $[\color{blue}{1}, \color{blue}{2}]$ for this level.

  1. On the 3nd level we obtain

    • $[\color{blue}{1}, [\color{blue}{1}, \color{blue}{2}]],\quad$  $[\color{blue}{2}, [\color{blue}{1}, \color{blue}{2}]],$

      where we have already dismissed the difference between left- and right bracketing. No further simplifications are possible.

  1. On the 4th level we obtain

    • $[\color{blue}{1}, [\color{blue}{1}, [\color{blue}{1}, \color{blue}{2}]],\quad$ $[\color{blue}{1}, [\color{blue}{2}, [\color{blue}{1}, \color{blue}{2}]]$
    • $[\color{blue}{2}, [\color{blue}{1}, [\color{blue}{1}, \color{blue}{2}]],\quad$ $[\color{blue}{2}, [\color{blue}{2}, [\color{blue}{1}, \color{blue}{2}]]$
    • $[[\color{blue}{1}, \color{blue}{2}], [\color{blue}{1}, \color{blue}{2}]]$

      The last term is clearly zero, however we have by the Jacobi identity we have $$0 = [[\color{blue}{1}, \color{blue}{2}], [\color{blue}{1}, \color{blue}{2}]] = - [\color{blue}{1}, [\color{blue}{2}, [\color{blue}{1}, \color{blue}{2}]]] - [\color{blue}{2}, [[\color{blue}{1}, \color{blue}{2}], \color{blue}{1}]] \\ \qquad\qquad\qquad\hspace{.4em} = - [\color{blue}{1}, [\color{blue}{2}, [\color{blue}{1}, \color{blue}{2}]]] + [\color{blue}{2}, [\color{blue}{1}, [\color{blue}{1}, \color{blue}{2}]]]$$ hence we only need one of the two terms of the above r.h.s.

Hence using the properties of the Lie bracket we arrived at the following basis for $\mathfrak{g}^{4}(\mathbb{R}^2)$

$$\color{blue}{1},\; \color{blue}{2}, \\ [\color{blue}{1}, \color{blue}{2}],\\ [\color{blue}{1}, [\color{blue}{1}, \color{blue}{2}]],\; [\color{blue}{2}, [\color{blue}{1}, \color{blue}{2}]],\\ [\color{blue}{1}, [\color{blue}{1}, [\color{blue}{1}, \color{blue}{2}]],\; [\color{blue}{1}, [\color{blue}{2}, [\color{blue}{1}, \color{blue}{2}]],\; [\color{blue}{2}, [\color{blue}{2}, [\color{blue}{1}, \color{blue}{2}]]$$

yielding $\dim(\mathfrak{g}^{4}(\mathbb{R}^2)) = 8$.

This already hints, why it is much more efficient to save the log-signature instead of the signature itself.

Indeed, the level 4 truncated signature has dimension

$$\dim(T^4(\mathbb{R}^2))-1 = 2 + 2^2 + 2^3 + 2^4 = 30.$$

How can we systemize the above considerations?

One way two construct a basis of the free Lie algebra is through so-called Lyndon words.

Unfortunately a detailed description of Lyndon words would go beyond the scope of this lecture.

However, we can look again at the $d=2$ case more systematically. The following are the first Lyondon words up to the level 4:

$$\color{blue}{1},\; \color{blue}{2},\; \color{blue}{1}\color{blue}{2},\; \color{blue}{1}\color{blue}{1}\color{blue}{2},\; \color{blue}{1}\color{blue}{2}\color{blue}{2},\; \color{blue}{1}\color{blue}{1} \color{blue}{1}\color{blue}{2},\; \color{blue}{1}\color{blue}{1}\color{blue}{2}\color{blue}{2},\; \color{blue}{1}\color{blue}{2}\color{blue}{2}\color{blue}{2}.$$

Why? Because each word is comes before all its proper suffixes in lexicografical order.

For $\color{blue}{1}$ and $\color{blue}{2}$ there are no proper suffixes.

Every Lyndon word $w$ (longer then one) has at least one proper suffix that is also a Lyndon word. Denote by $Ly(w)$ the longest such suffix

Further one can show that for a Lyndon word $w$ with $w = v Ly(w)$, the suffix $v$ is also a Lyndon word.

On can define a map $\sigma$ from Lyndon words to the Lie algebra by

E.g. $\sigma(\color{blue}{1}\color{blue}{1}\color{blue}{2}\color{blue}{2}) = [\color{blue}{1}, \sigma(\color{blue}{1}\color{blue}{2}\color{blue}{2})] = [\color{blue}{1}, [\sigma(\color{blue}{1}\color{blue}{2}), \color{blue}{2}]] = [\color{blue}{1}, [[\color{blue}{1},\color{blue}{2}], \color{blue}{2}]]$.

One can show that the image of $\sigma$ forms a basis of the free Lie algebra.

In order to determine $\dim(\mathfrak{g}^{N}(\mathbb{R}^d))$ one can count the number of Lyndon words up to level $N$. Algebraically this is related to counting necklaces and one obtains

$$\dim(\mathfrak{g}^{N}(\mathbb{R}^d)) := \sum_{n=1}^N \frac{1}{n} \sum_{k | n}\mu(n/k) d^k,$$

where the inner sum is taken over all divisors $k$ of $n$ and $\mu$ is the Möbius function.

Here is a comparison of the $\eta_{d, N} := \dim(\mathfrak{g}^{N}(\mathbb{R}^d))$ with the dimension of the signature $\sigma_{d,N} := \dim(T^{N}(\mathbb{R}^d))$

1.4 The Signature of Piecewise Linear Paths

Suppose that we are given a discrete time series $\{x_0, x_2, \dots, x_J\}\subset\mathbb{R}^d$. Denote by $x:[0,J]\to \mathbb{R}^d$ the piecewise linear interpolation and also define $\Delta x_j = x_j - x_{j-1}$.

First note that, due to linearity, in each interval interval $[j-1,j]$ we have

$$\idotsint_{n-1 \le t_1 \le \dots \le t_n \le n} dx_{t_1} \otimes \dots \otimes dx_{t_n} = (\Delta x_j)^{\otimes n} \int_{\Delta_n} (dt)^n = \frac{1}{n!} (\Delta x_j)^{\otimes n}.$$

Hence the signature is given by

$$\mathbb{x}^{<\infty}_{j-1, j} = 1 + \sum_{n=1}^\infty \frac{1}{n!} (\Delta x_j)^{\otimes n} = \exp(\Delta x_j).$$

Then using Chen's identity, we have

$$\mathbb{x}^{<\infty}_{0, j} = \exp(\Delta x_1) \otimes \cdots \otimes \exp(\Delta x_j).$$

Hence in order to calculate the log-signature of a piecewise linear path we need to understand how to multiply two non-commutative exponentials.

Theorem (Baker-Campbell-Hausdorff + Dynkin formula)

Let $\mathbf{a}, \mathbf{b} \in \mathfrak{g}(\mathbb{R}^d)$ then it holds

$$\log(\exp(\mathbf{a})\otimes\exp(\mathbf{b})) = \sum_{n=1}^{\infty} \sum_{k=1}^{n} \sum_{s_i + r_i > 0\\s_1+ \cdots+ r_k = n} \frac{(-1)^{n-1} [\mathbf{a}^{s_1} \mathbf{b}^{r_1} \cdots \mathbf{a}^{s_k} \mathbf{b}^{r_k}] }{k \cdot n \cdot s_1!r_1!\cdots s_k!r_k!}$$

where $[\mathbf{a}^{r}\mathbf{c}] = [\mathbf{a}, [\mathbf{a}, \cdots (r \text{ times}) \cdots [\mathbf{a}, \mathbf{c}]\dots]]$.

Up to the first three levels the BCH-formula in a Lie basis reads

$$\log(\exp(\mathbf{a})\otimes\exp(\mathbf{b})) = \mathbf{a} + \mathbf{b} +\frac{1}{2}[\mathbf{a}, \mathbf{b}] + \frac{1}{12}([\mathbf{a}, [\mathbf{a}, \mathbf{b}]] + [[\mathbf{b}, \mathbf{a}], \mathbf{b}]) \, + \dots$$

Returning to the log-signature of piecewise linear paths...

Again we consider the case $d=2$ for simplicity and we truncate the log-signature at level 3.

Let us define coefficients in terms of the Lie basis

$$\log\mathbb{x}^{\le 3}_{0,t} =: \ell^{\le3}_t =: \ell^1_t\color{blue}{1} + \ell^2_t(t)\color{blue}{2} + \ell^{12}_t[\color{blue}{1},\color{blue}{2}] + \ell^{112}_t[\color{blue}{1},[\color{blue}{1},\color{blue}{2}]] + \ell^{122}_t[[\color{blue}{1},\color{blue}{2}], \color{blue}{2}]$$

By previous considerations we can calculate the log signature iteratively using the BCH-formula. Indeed

$$ \ell^{\le3}_j = \log(\exp(\ell^{\le3}_{j-1})\otimes \exp(\Delta x_j)) \hspace{19em} \\= \ell^{\le3}_{j-1} + \Delta x_j + \frac{1}{2}[\ell^{\le3}_{j-1}, \Delta x_j] + \frac{1}{12}([\ell^{\le3}_{j-1}, [\ell^{\le3}_{j-1}, \Delta x_j]] + [[\Delta x_j, \ell^{\le3}_{j-1}], \Delta x_j]) \, + \dots \\ = \ell^{\le3}_{j-1} + \Delta x_j + \frac{1}{2}[\ell^{\le2}_{j-1}, \Delta x_j] + \frac{1}{12}([\ell^{\le1}_{j-1}, [\ell^{\le1}_{j-1}, \Delta x_j]] + [[\Delta x_j, \ell^{\le1}_{j-1}], \Delta x_j]) \, + \dots$$

Comparing coefficients we obtain

Here is an implmentation of this computation:

Let us do a sanity check:

For the general case, i.e., higher dimension and truncation level one can pre-calculate the BCH-formula in Lyndon basis. This is done by the iisignature package. Let us wrap our function from above:

2. Implementing Deep Signature Stopping Rules

Recall that we have considered randomized stopping rules of the form

$$\tau_\theta = \inf\left\{ t \in [0,T] \;\bigg\vert\; \int_0^t \theta(\mathbb{X}\vert_{[0,s]})^2 ds \;\ge\; Z\right\},$$


We have studied the case where $\theta(\mathbb{X}\vert_{[0,t]}) = \langle \ell, \mathbb{X}_{0,t} \rangle$ a linear functional of the signature.

In the following we will, however, focus on the case where $\theta$ is a deep neural network.

More specifically we define a deep signature stopping rule by

$$\theta(\mathbb{X}\vert_{[0,t]}) = (A_I \circ \varphi \circ A_{I-1} \circ \varphi \cdots \circ A_0)(\log \mathbb{X}^{\le N}_{0,t}),$$


Denote by $\mathcal{T}_{\log}$ the space of deep signature stopping rules and recall that we had also the following approximation result

Theorem (Bayer-H-Riedel-Schoenmakers)

Let Y be adapted to the filtration generated by $\mathbb{X}$ with $\mathbb{E}[\Vert Y \Vert_\infty] < \infty$ then it holds

$$\sup_{\tau \in \mathcal{S}}\mathbb{E}[Y_\tau] \;=\; \sup_{\theta \in \mathcal{T}_{\log}}\mathbb{E}[Y_{\tau_\theta}],$$

where $\mathcal{S}$ is the set of all stopping times.

One of the important steps in the proof was that due to randomization the expected stopped value is of the form

$$\mathbb{E}[Y_{\tau_\theta}] = Y_0 + \mathbb{E}\left[ \int_0^T \exp\left(-\int_0^t \theta(\mathbb{X}\vert_{[0,s]})^2 ds \right)dY_t\right],$$

for the case that we have chosen $Z\sim\mathrm{Exp}(1)$.

This expression immediately suggest a numerical approach to optimization

  1. Sample $\mathbb{X}$ and $Y$ on a time-grid $\{t_0, \cdots, t_n\}$ say $m$ times.
  1. Calculate the log-signature for discretized sample path of $\hat{\mathbb{X}}$ for some truncation level $N$.
  1. In the above expression, replace integrals and expectations by discrete sums and optimze over $\theta \in \mathcal{T}_{\log}$ using backpropagation and stochastic gradient decent.
  1. Resimulate $\mathbb{X}$, $Y$ and $\log(\hat{\mathbb{X}}^{\le N})$ and use the result of the optimization $\theta^{\ast} \in \mathcal{T}_{\log}$ to calculate a lower bound.

One remark before we start with the implementation:

Note that the number of parameters of the neural network depends on $q, I, N$ and $d$.

It does not depend on $n$, the of the discretization grid!

This is in strong contrast to approaches where the lifting the path to a $n$-dimensional Markov process

$$(X_{t_0}, 0, \dots, 0),\; (X_{t_1}, X_{t_0}, 0, \dots, 0),\; (X_{t_2}, X_{t_1}, X_{t_0}, 0, \dots, 0),\; \dots \in \mathbb{R}^{n\times d}.$$

3. Optimal Stopping of Fractional Brownian Motion

Finally, we turn to a specific example.

Let $X^H$ be a fractional Brownian motion (fBm) with Hurst parameter $H\in(0,1]$.

Note that the time augmented path $\hat{X} = ((t, X^H_t))_{0 \le t \le T}$ is a rough path. (Indeed, no lift is needed since all iterated (cross) integrals are pathwise well-defined as Young integrals.)

We are interested in the optimal stopping of the fBm, hence $Y=X^{H}$, i.e., in approximating


First we will need to generate sample paths of fBm.

As explained on the exercise sheet, this can efficiently be done using the circulant embedding method and the fast Fourier transform.

... collecting all the code

Generating the data and training the network

Calculating Lower Bounds

Visualisation of the trained signature stopping rule

Results for different Hurst parameters