Briefly, neural networks make predictions via forward propagation, whereby information advances in a single direction from the input nodes, through any hidden layers to the output nodes and the dot product for each input node and associated weights is computed. In addition, to feeding data through the network via forward propagation, activation functions are applied to values coming into a node. Activation functions (e.g. identity, sigmoid, tanh, ReLu) allow the model to capture non-linarites. The loss function aggregates errors in predictions from many data points into a single number, to indicate model performance. The classical and still preferred training algorithm for neural networks is the stochastic gradient descent. The gradient descent algorithm finds the weights that give the lowest value for the loss function. In order to employ backpropagation, the network must utilise activation functions which are differentiable. Backpropagation is simply gradient descent on individual errors. Predictions of the neural network are compared with the desired output and then the gradient of the errors with respect to the weights of the neural network are updated.

In traditional neural networks, we assume that all inputs (and outputs) are independent of each other. In contrast, recurrent neural networks (RNNs) make use of sequential information. RNNs add additional weights to the network to create cycles in the network graph to maintain an internal state between time steps. This maintenance of state means networks will be able to explicitly learn and exploit sequence prediction problems. Typically, RNNs are difficult to train, since these networks use gradient based methods (e.g. backpropagation) the problem of vanishing/exploding gradients occurs resulting in long training times and diminished accuracy. If the sequences are quite long, the gradients (values calculated to tune the network) computed during training (backpropagation) either vanish (multiplication of many 0 < values < 1) or explode (multiplication of many large values) causing it to train very slowly. Long Short Term Memory is a RNN architecture which addresses the problem of training over long sequences and retaining memory. LSTMs solve the gradient problem by introducing a few more gates that control access to the cell state. There are three types of gates within a unit: 1). Forget Gate: conditionally decides what information to throw away from the block. 2). Input Gate: conditionally decides which values from the input to update the memory state. 3). Output Gate: conditionally decides what to output based on input and the memory of the block. Each unit is like a mini-state machine where the gates of the units have weights that are learned during the training procedure.

Much of vast literature available on recurrent neural networks (RNNs) focus on interesting topics such as:

*Different types of RNNs

· RNN

· LSTM

· GRU

*Model architectures

*Deep learning performance

• Improve Performance With Data. (Get More Data, Invent More Data, Rescale Your Data, Transform Your Data, Feature Selection)

• Improve Performance With Algorithms. (Spot-Check Algorithms, Steal From Literature, Resampling Methods.)

• Improve Performance With Algorithm Tuning. (Diagnostics, Weight Initialization, Learning Rate, Activation Functions, Network Topology, Batches and Epochs, Regularization, Optimization and Loss, Early Stopping)

• Improve Performance With Ensembles. (Combine Models, Combine Views, Stacking)

Whilst all these topics are important, like a lot of data science topics information on the basics are lacking. In this case I am referring to data shaping for RNNs. The example below illustrates differences in the shape of input data for a NN and RNN.

A simple Neural Network

The example of a simple classification type neural network below aims to classify diabetes status (0/1) of women based upon several features: Number of times pregnant, Plasma glucose concentration a 2 hours in an oral glucose tolerance test, Diastolic blood pressure (mm Hg), Triceps skin fold thickness (mm), 2-Hour serum insulin (mu U/ml), Body mass index (weight in kg/(height in m)^2), Diabetes pedigree function, Age (years). As illustrated the input data is a 2D array.

A simple LSTM model

The simple LSTM model below aims to predict the number of international airline passengers in units of 1000. The consists of 144 observations, representative of monthly data between January 1949 to December 1960 (12 years). As illustrated the input data has been reshaped to a 3D array so it is representative samples, time steps and features.