Machine Learning for Trading

Logo

A comprehensive introduction to how ML can add value to the design and execution of algorithmic trading strategies

View the Project on GitHub stefan-jansen/machine-learning-for-trading

Deep Reinforcement Learning: Building a Trading Agent

Reinforcement Learning (RL) is a computational approach to goal-directed learning performed by an agent that interacts with a typically stochastic environment which the agent has incomplete information about. RL aims to automate how the agent makes decisions to achieve a long-term objective by learning the value of states and actions from a reward signal. The ultimate goal is to derive a policy that encodes behavioral rules and maps states to actions.

This chapter shows how to formulate an RL problem and how to apply various solution methods. It covers model-based and model-free methods, introduces the OpenAI Gym environment, and combines deep learning with RL to train an agent that navigates a complex environment. Finally, we’ll show you how to adapt RL to algorithmic trading by modeling an agent that interacts with the financial market while trying to optimize an objective function.

Table of contents

  1. Key elements of a reinforcement learning system
  2. How to solve RL problems
  3. Deep Reinforcement Learning
  4. Code example: deep RL for trading with TensorFlow 2 and OpenAI Gym
  5. Resources

Key elements of a reinforcement learning system

RL problems feature several elements that set them apart from the ML settings we have covered so far. The following two sections outline the key features required for defining and solving an RL problem by learning a policy that automates decisions. We’ll use the notation and generally follow Reinforcement Learning: An Introduction (Sutton and Barto 2018) and David Silver’s UCL Courses on RL that are recommended for further study beyond the brief summary that the scope of this chapter permits.

RL problems aim to optimize an agent’s decisions based on an objective function vis-a-vis an environment.

The policy: translating states into actions

At any point in time, the policy defines the agent’s behavior. It maps any state the agent may encounter to one or several actions. In an environment with a limited number of states and actions, the policy can be a simple lookup table filled in during training.

Rewards: learning from actions

The reward signal is a single value that the environment sends to the agent at each time step. The agent’s objective is typically to maximize the total reward received over time. Rewards can also be a stochastic function of the state and the actions. They are typically discounted to facilitate convergence and reflect the time decay of value.

The value function: optimal decisions for the long run

The reward provides immediate feedback on actions. However, solving an RL problem requires decisions that create value in the long run. This is where the value function comes in: it summarizes the utility of states or of actions in a given state in terms of their long-term reward.

The environment

The environment presents information about its state to the agent, assigns rewards for actions, and transitions the agent to new states subject to probability distributions the agent may or may not know about. It may be fully or partially observable, and may also contain other agents. The design of the environment typically requires significant up-front design effort to facilitate goal-oriented learning by the agent during training.

RL problems differ by the complexity of their state and action spaces that can be either discrete or continuous. The latter requires ML to approximate a functional relationship between states, actions, and their value. They also require us to generalize from the subset of states and actions they are experienced by the agent during training.

Components of an interactive RL system

The components of an RL system typically include:

In addition, the environment emits a reward signal that reflects the new state resulting from the agent’s action. At the core, the agent usually learns a value function that shapes its judgment over actions. The agent has an objective function to process the reward signal and translate the value judgments into an optimal policy.

How to solve RL problems

RL methods aim to learn from experience on how to take actions that achieve a long-term goal. To this end, the agent and the environment interact over a sequence of discrete time steps via the interface of actions, state observations, and rewards that we described in the previous section.

There are numerous approaches to solving RL problems which implies finding rules for the agent’s optimal behavior:

Approaches for continuous state and/or action spaces often leverage ML to approximate a value or policy function. Hence, they integrate supervised learning, and in particular, the deep learning methods we discussed in the last several chapters. However, these methods face distinct challenges in the RL context:

Code example: dynamic programming – value and policy iteration

Finite MDPs are a simple yet fundamental framework. This section introduces the trajectories of rewards that the agent aims to optimize, and define the policy and value functions they are used to formulate the optimization problem and the Bellman equations that form the basis for the solution methods.

The notebook gridworld_dynamic_programming applies Value and Policy Iteration to a toy environment that consists of a 3 x 4 grid.

Code example: Q-Learning

Q-learning was an early RL breakthrough when it was developed by Chris Watkins for his PhD thesis in 1989 . It introduces incremental dynamic programming to control an MDP without knowing or modeling the transition and reward matrices that we used for value and policy iteration in the previous section. A convergence proof followed three years later by Watkins and Dayan.

Q-learning directly optimizes the action-value function, q, to approximate q*. The learning proceeds off-policy, that is, the algorithm does not need to select actions based on the policy that’s implied by the value function alone. However, convergence requires that all state-action pairs continue to be updated throughout the training process. A straightforward way to ensure this is by using an ε-greedy policy.

The Q-learning algorithm keeps improving a state-action value function after random initialization for a given number of episodes. At each time step, it chooses an action based on an ε-greedy policy, and uses a learning rate, α, to update the value function based on the reward and its current estimate of the value function for the next state.

The notebook gridworld_q_learning demonstrates how to build a Q-learning agent using the 3 x 4 grid of states from the previous section.

Deep Reinforcement Learning

This section adapts Q-Learning to continuous states and actions where we cannot use the tabular solution that simply fills an array with state-action values. Instead, we will see how to approximate the optimal state-value function using a neural network to build a deep Q network with various refinements to accelerate convergence. We will then see how we can use the OpenAI Gym to apply the algorithm to the Lunar Lander environment.

Value function approximation with neural networks

As in other fields, deep neural networks have become popular for approximating value functions. However, ML faces distinct challenges in the RL context where the data is generated by the interaction of the model with the environment using a (possibly randomized) policy:

The Deep Q-learning algorithm and extensions

Deep Q learning estimates the value of the available actions for a given state using a deep neural network. It was introduced by Deep Mind’s Playing Atari with Deep Reinforcement Learning (2013), where RL agents learned to play games solely from pixel input.

The deep Q-learning algorithm approximates the action-value function, q, by learning a set of weights, θ, of a multi-layered Deep Q Network (DQN) that maps states to actions.

Several innovations have improved the accuracy and convergence speed of deep Q-Learning, namely:

The Open AI Gym – the Lunar Lander environment

The OpenAI Gym is a RL platform that provides standardized environments to test and benchmark RL algorithms using Python. It is also possible to extend the platform and register custom environments.

The Lunar Lander (LL) environment requires the agent to control its motion in two dimensions, based on a discrete action space and low-dimensional state observations that include position, orientation, and velocity. At each time step, the environment provides an observation of the new state and a positive or negative reward. Each episode consists of up to 1,000 time steps.

Code example: Double Deep Q-Learning using Tensorflow

The lunar_lander_deep_q_learning notebook implements a DDQN agent that uses TensorFlow and Open AI Gym’s Lunar Lander environment.

Code example: deep RL for trading with TensorFlow 2 and OpenAI Gym

To train a trading agent, we need to create a market environment that provides price and other information, offers trading-related actions, and keeps track of the portfolio to reward the agent accordingly.

How to Design an OpenAI trading environment

The OpenAI Gym allows for the design, registration, and utilization of environments that adhere to its architecture, as described in its documentation.

The trading environment consists of three classes that interact to facilitate the agent’s activities:

  1. The DataSource class loads a time series, generates a few features, and provides the latest observation to the agent at each time step.
  2. TradingSimulator tracks the positions, trades and cost, and the performance. It also implements and records the results of a buy-and-hold benchmark strategy.
  3. TradingEnvironment itself orchestrates the process.

How to build a Deep Q-learning agent for the stock market

The notebook q_learning_for_trading demonstrates how to set up a simple game with a limited set of options, a relatively low-dimensional state, and other parameters that can be easily modified and extended to train the Deep Q-Learning agent used in lunar_lander_deep_q_learning.

Resources

RL Algorithms

Investment Applications