Machine Learning for Trading

Logo

A comprehensive introduction to how ML can add value to the design and execution of algorithmic trading strategies

View the Project on GitHub stefan-jansen/machine-learning-for-trading

Convolutional Neural Networks: Time Series as Images

In this chapter, we introduce the first specialized Deep Learning architectures that we will cover in part 4. Deep Convolutional Neural Networks, also ConvNets or CNN, have enabled superhuman performance in classifying images, video, speech, and audio. Recurrent nets, the subject of the following chapter, have performed exceptionally well on sequential data such as text and speech.

CNNs are named after the linear algebra operation called convolution that replaces the general matrix multiplication typical of feed-forward networks (discussed in the last chapter on Deep Learning) in at least one of their layers. We will discuss how convolutions work and why they are particularly useful to data with a certain regular structure like images or time series.

Research into CNN architectures has proceeded very rapidly and new architectures that improve benchmark performance continue to emerge. We will describe a set of building blocks that consistently appears in successful applications and illustrate their application to image data and financial time series. We will also demonstrate how transfer learning can speed up learning by using pre-trained weights for some of the CNN layers.

Content

  1. How CNNs learn to model grid-like data
  2. CNN for Images: From Satellite Data to Object Detection
  3. CNN for time series data: predicting stock returns

How CNNs learn to model grid-like data

CNNs are conceptually similar to the feedforward NNs we covered in the previous chapter. They consist of units that contain parameters called weights and biases, and the training process adjusts these parameters to optimize the network’s output for a given input. Each unit applies its parameters to a linear operation on the input data or activations received from other units, possibly followed by a non-linear transformation.

CNNs differ because they encode the assumption that the input has a structure most commonly found in image data where pixels form a two-dimensional grid, typically with several channels to represent the components of the color signal, such as the red, green and blue channels of the RGB color model.

The most important element to encode the assumption of a grid-like topology is the convolution operation that gives CNNs their name, combined with pooling. We will see that the specific assumptions about the functional relationship between input and output data implies that CNNs need far fewer parameters and compute more efficiently.

Code example: From hand-coding to learning and synthesizing filters from data

For image data, this local structure has traditionally motivated the development of hand-coded filters that extract such patterns for the use as features in machine learning models.

How the key elements of a convolutional layer operate

Fully-connected feedforwardNNs make no assumptions about the topology, or local structure of the input data so that arbitrarily reordering the features has no impact on the training result.

For many data sources, however, local structure is quite significant. Examples include autocorrelation in time series or the spatial correlation among pixel values due to common patterns like edges or corners. For image data, this local structure has traditionally motivated the development of hand-coded filter methods that extract local patterns for the use as features in machine learning models.

Computer Vision Tasks

Image classification is a fundamental computer vision task that requires labeling an image based on certain objects it contains. Many practical applications, including investment and trading strategies, require additional information.

The evolution of CNN architectures: key innovations

Several CNN architectures have pushed performance boundaries over the past two decades by introducing important innovations. Predictive performance growth accelerated dramatically with the arrival of big data in the form of ImageNet (Fei-Fei 2015) with 14 million images assigned to 20,000 classes by humans via Amazon’s Mechanical Turk. The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) became the focal point of CNN progress around a slightly smaller set of 1.2 million images from 1,000 classes.

CNN for Images: From Satellite Data to Object Detection

This section demonstrates how to solve key computer vision tasks such as image classification and object detection. As mentioned in the introduction and in Chapter 3 on alternative data, image data can inform a trading strategy by providing clues about future trends, changing fundamentals, or specific events relevant for a target asset class or investment universe. Popular examples include exploiting satellite images for clues about the supply of agricultural commodities, consumer and economic activity, or the status of manufacturing or raw material supply chains. Specific tasks might include, for example:

Code example: LeNet5: The first CNN with industrial applications

All libraries we introduced in the last chapter provide support for convolutional layers.

The notebook digit_classification_with_lenet5 illustrates the LeNet5 architecture using the most basic MNIST handwritten digit dataset,

Code example: AlexNet - reigniting deep learning research

Fast-forward to 2012, and we move on to the deeper and more modern AlexNet architecture. We will use the CIFAR10 dataset that uses 60,000 ImageNet samples, compressed to 32x32 pixel resolution (from the original 224x224), but still with three color channels. There are only 10 of the original 1,000 classes.

See the notebook image_classification_with_alexnet for implementation, including the use of data augmentation.

Code example: transfer learning with VGG16 in practice

In practice, we often do not have enough data to train a CNN from scratch with random initialization. Transfer learning is a machine learning technique that repurposes a model trained on one set of data for another task. Naturally, it works if the learning from the first task carries over to the task of interest. If successful, it can lead to better performance and faster training that requires less labeled data than training a neural network from scratch on the target task.

Tensorflow 2, for example, contains pre-trained models for several of the reference architectures discussed previously, namely VGG16 and its larger version VGG19, ResNet50, InceptionV3, and InceptionResNetV2, as well as MobileNet, DenseNet, NASNet, and MobileNetV2.

The transfer learning approach to CNN relies on pre-training on a very large dataset like ImageNet. The goal is that the convolutional filters extract a feature representation that generalizes to new images. In a second step, it leverages the result to either initialize and retrain a new CNN or as inputs to in a new network that tackles the task of interest.

CNN architectures typically use a sequence of convolutional layers to detect hierarchical patterns, adding one or more fully-connected layers to map the convolutional activations to the outcome classes or values. The output of the last convolutional layer that feeds into the fully-connected part is called bottleneck features. We can use the bottleneck features of a pre-trained network as inputs into a new fully-connected network, usually after applying a ReLU activation function.

In other words, we freeze the convolutional layers and replace the dense part of the network. An additional benefit is that we can then use inputs of different sizes because it is the dense layers that constrain the input size.

Alternatively, we can use the bottleneck features as inputs into a different machine learning algorithm. In the AlexNet architecture, e.g., the bottleneck layer computes a vector with 4096 entries for each 224 x 224 input image. We then use this vector as features for a new model.

Alternatively, we can go a step further and not only replace and retrain the classifier on top of the CNN using new data but to also fine-tune the weights of the pre-trained CNN. To achieve this, we continue training, either only for later layers while freezing the weights of some earlier layers. The motivation is to preserve presumably more generic patterns learned by lower layers, such as edge or color blob detectors while allowing later layers of the CNN to adapt to the details of a new task. ImageNet, e.g., contains a wide variety of dog breeds which may lead to feature representations specifically useful for differentiating between these classes.

How to extract bottleneck features

The notebook bottleneck_features illustrates how to download the pre-trained VGG16 model, either with the final layers to generate predictions or without the final layers to extract the outputs produced by the bottleneck features.

How to fine-tune a pre-trained model

The notebook transfer_learning, adapted from a TensorFlow 2 tutorial, demonstrates how to freeze some or all of the layers of a pre-trained model and continue training using a new fully-connected set of layers and data with a different format.

Code example: identifying land use with satellite images using transfer learning

Satellite images figure prominently among alternative data (see Chapter 3). For instance, commodity traders may rely on satellite images to predict the supply of certain crops or activity at mining sites, oil or tanker traffic.

To illustrate working with this type of data, we load the EuroSat dataset included in the TensorFlow 2 datasets (Helber et al. 2017). The EuroSat dataset includes around 27,000 images in 64x64 format that represent 10 different types of land uses.

The notebook satellite_images downloads the DenseNet201 architecture from tensorflow.keras.applications and replace its final layers.

We use 10 percent of the training images for validation purposes and achieve the best out-of-sample classification accuracy of 97.96 percent after ten epochs. This exceeds the performance cited in the original paper for the best performing ResNet-50 architecture with 90-10 split.

Code example: object detection in practice with Google Street View House Numbers

Object detection requires the ability to distinguish between several classes of objects and to decide how many and which of these objects are present in an image.

A prominent example is Ian Goodfellow’s identification of house numbers from Google’s street view dataset. It requires to identify

See the data directory for instructions on obtaining the dataset.

Preprocessing the source images

The notebooks svhn_preprocessing contains code to produce a simplified, cropped dataset that uses bounding box information to create regularly shaped 32x32 images containing the digits; the original images are of arbitrary shape.

Transfer learning with a custom final layer for multiple outputs

The notebook svhn_object_detection goes on to illustrate how to build a deep CNN using Keras’ functional API to generate multiple outputs: one to predict how many digits are present, and five for the value of each in the order they appear.

CNN for time series data: predicting stock returns

CNN were originally developed to process image data and have achieved superhuman performance on various computer vision tasks. As discussed in the first section, time series data has a grid-like structure similar to that of images, and CNN have been successfully applied to one-, two- and three dimensional representations of temporal data.

The application of CNN to time series will most likely bear fruit if the data meets the model’s key assumption that local patterns or relationships help predict the outcome. In the time-series context, local patterns could be autocorrelation or similar non-linear relationships at relevant intervals. Along the second and third dimension, local patterns imply systematic relationships among different components of a multivariate series or among these series for different tickers. Since locality matters, it is important that the data is organized accordingly in contrast to feed-forward networks where shuffling the elements of any dimension does not negatively affect the learning process.

Code example: building an autoregressive CNN with 1D convolutions

We will introduce the time series use case for CNN with a univariate autoregressive asset return model. More specifically, the model receives the most recent 12 months of returns and uses a single layer of one-dimensional convolutions to predict the subsequent month.

The notebook time_series_prediction illustrates the time series use case with the univariate asset price forecast example we introduced in the last chapter. Recall that we create rolling monthly stock returns and use the 24 lagged returns alongside one-hot-encoded month information to predict whether the subsequent monthly return is positive or negative.

Code example: CNN-TA - clustering financial time series in 2D image format

To exploit the grid-like structure of time-series data, we can use CNN architectures for univariate and multivariate time series. In the latter case, we consider different time series as channels, similar to the different color signals.

An alternative approach converts a time series of alpha factors into a two-dimensional format to leverage the ability of CNNs to detect local patterns. Sezer and Ozbayoglu (2018) propose CNN-TA that computes 15 technical indicators for different intervals and uses hierarchical clustering (see Chapter 13) to locate indicators that behave similarly close to each other in a 2D grid.

Creating the 2D time series of financial indicators

The notebook engineer_cnn_features creates technical indicators at different intervals.

Select and cluster the most relevant features

The notebook convert_cnn_features_to_image_format selects the 15 most relevant features from the 20 candidates to fill the 15⨉15 input grid and then applies hierarchical clustering.

Create and train a convolutional neural network

Now we are ready to design, train and evaluate a CNN following the steps outlined in the previous section. The notebook cnn_for_trading contains the relevant code examples.

Backtesting a long-short trading strategy

To get a sense of the signal quality, we compute the spread between equal-weighted portfolios invested in stocks selected according to the signal quintiles using Alphalens (see Chapter 4).