# The back propagation algorithm

One of the most popular algorithms for training multilayer artificial **neural networks** is the back propagation algorithm, or retro-propagation algorithm. In this article I will try to explain its fundamentals, through a simplified implementation of a **neural network** that allows testing with different configurations of the network.

In this link you can download the source code of the BackPropagationDemo solution, with the code and data examples used in this article. It is written in **csharp** using **Visual Studio** **2017**.

The code is based on the one in this post by Andrew Kirillov in **CodeProject**, although I have simplified it a lot to make it easier to focus on the learning algorithm.

In order to apply this algorithm, the network must be a **feedforward** one, that is, it must have more than one layer of neurons and these must not form cycles; the outputs of one layer only feed the inputs of the next one, as in this figure:

The input data of the network is represented by the light circles on the left; the network has three hidden layers with 5, 4 and 3 **neurons** and an output layer with a single neuron. Each **neuron** has a single output, which is applied as input to all **neurons** in the next layer, and has as many inputs as the number of **neurons** in the previous layer.

Each **neuron** assigns a weight to each of its inputs, that is, it multiplies the value of each entry by a fixed amount depending on the importance assigned to it during the learning process; all these values, the entries multiplied by their weights, are added, and another value, called **threshold**, is added too to the result; this threshold is independent of the input and allows adjusting up or down the value of the sum of the entries times their weights. Each **neuron** implements an internal function that is applied to this sum to obtain the output result. Although each **neuron** can implement a different function, usually all of them use the same one; the function must be derivable in order to implement the backpropagation learning algorithm.

There are many appropriate functions to implement artificial neurons, I have chosen the bipolar sigmoid, whose result is bounded between -1 and 1 and has this form:

`S(x) = (2 / (1 + e`

^{-αx})) – 1

Its derivative is as follows:

`S’(x) = α (1 – x`

^{2}) / 2

The **α** parameter allows us to configure the function by providing a scale factor that allows its slope to be more or less smooth.

## The Neuron class

In the **NeuralNetwork** namespace, you will find the classes used to implement the neural network. The **Neuron** class implements the individual **neurons**, using the following data in its implementation:

`private double[] _weights = null;`

private double[] _weightUpdates = null;

private double _threshold = 0;

private double _thresholdUpdate = 0;

private double _alpha = 0;

private double _output = 0;

**_weights** is an array that contains the weights assigned to the inputs, while **_threshold** contains the threshold value of the **neuron**. **_weightUpdates** and **_thresholdUpdate** are used during the learning process to temporarily store the updates of the weights and the threshold. **_alpha** is the parameter to configure the sigmoid function, and **_output** is the value of the output, and is used to avoid performing redundant calculations.

When the **neuron** is created, the initial weights are randomly selected; the output value is calculated with the following methods:

`public double CalculateOutput(double[] inputs)`

{

double sum = 0;

for (int w = 0; w < _weights.Length; w++)

{

sum += inputs[w] * _weights[w];

}

_output = Sigmoid(sum + _threshold);

return _output;

}

private double Sigmoid(double x)

{

return (2 / (1 + Math.Exp(-_alpha * x))) - 1;

}

To the **CalculateOutput** method is passed an array with the values of the current inputs, and it returns the corresponding output value. The derivative of the function in the output, which will be used in the learning algorithm, is obtained from the **Delta** property:

`public double Delta`

{

get

{

return _alpha * (1 - (_output * _output)) / 2;

}

}

There are also some methods that will be used in the learning process to obtain and modify the weights assigned to the inputs:

`public double GetWeight(int input)`

{

return _weights[input];

}

public double GetWeightUpdate(int input)

{

return _weightUpdates[input];

}

public void SetWeightUpdate(int input, double update)

{

_weightUpdates[input] = update;

}

public void ApplyUpdates()

{

for (int w = 0; w < _weights.Length; w++)

{

_weights[w] += _weightUpdates[w];

}

Threshold += _thresholdUpdate;

}

## The Layer class

**Neurons** are organized in layers. A layer is simply an array of **Neuron** objects, and it is implemented in the **Layer** class. The calculations required for the backpropagation algorithm are also implemented here. These are the data used in the implementation of this class:

`private Neuron[] _neurons = null;`

private Layer _prevLayer = null;

private Layer _nextLayer = null;

private double[] _outputs = null;

private double[] _errors = null;

private double _momentum = 0;

The **_neurons** array contains the neurons of the layer, and with **_prevLayer** and **_nextLayer** we have references to the previous layer and the next layer, which we will use in the process of learning and propagating the calculations when computing the inputs of the network to obtain the output. **_outputs** contains the outputs of the neurons in the layer, to avoid performing redundant calculations. The two remaining variables are used only during the learning process: **_errors** will contain the calculated errors for the output of each of the neurons in the layer, and **_momentum** is a parameter that allows us to configure the updating of the weights, as we will see when reviewing the algorithm itself.

There is a special constructor for the input layer and another one for the rest of layers. The difference is that the input layer does not have a previous layer, and the number of inputs per **neuron** is equal to the number of values that are presented to the network for each input vector, while the rest of the layers all have a previous layer and the number of entries per neuron is the number of outputs of that previous layer:

`public Layer(int inputs, int neurons, double alpha)`

{

SetNeurons(inputs, neurons, alpha);

}

public Layer(int neurons, double alpha, Layer prevlayer)

{

SetNeurons(prevlayer.OutputCount, neurons, alpha);

_prevLayer = prevlayer;

_prevLayer.NextLayer = this;

}

In order to process a data vector, the following method is used, which must be called only for the first layer, with the input data as a parameter:

`public void Process(double [] inputs)`

{

for(int n = 0; n < _neurons.Length; n++)

{

_outputs[n] = _neurons[n].CalculateOutput(inputs);

}

if (_nextLayer != null)

{

_nextLayer.Process(Outputs);

}

}

The results will be obtained by reading the outputs of the last layer.

## The Network class

The **Network** class implements the whole **neural network**, as an array of **Layer** objects. This is the constructor, where everything is initialized:

`public Network(int datainputs, int inputs, int outputs, `

int[] hidden, double alpha)

{

_layers = new Layer[2 + ((hidden != null) ? hidden.Length : 0)];

_layers[0] = new Layer(datainputs, inputs, alpha);

Layer prev = _layers[0];

if ((hidden != null) && (hidden.Length > 0))

{

for (int h = 0; h < hidden.Length; h++)

{

_layers[1 + h] = new Layer(hidden[h], alpha, prev);

prev = _layers[1 + h];

}

}

_layers[_layers.Length - 1] = new Layer(outputs, alpha, prev);

_alpha = alpha;

}

The **datainputs** parameter contains the number of values that make up each input vector, **inputs** is the number of **neurons** in the input layer, **outputs** the number of neurons in the output layer, and the **hidden** array contains as many elements as hidden layers are, each one with the number of neurons corresponding to the layer; this parameter can be null, in which case the network will only have an input layer and an output layer. The **alpha** parameter is used to initialize the **neurons**.

To process an input vector, or an array of vectors, the following methods are used:

`public double[] Compute(double[] inputs)`

{

_layers[0].Process(inputs);

return _layers[_layers.Length - 1].Outputs;

}

public double[][] Compute(double[][] inputs)

{

double[][] result = new double[inputs.Length][];

for (int ix = 0; ix < inputs.Length; ix++)

{

_layers[0].Process(inputs[ix]);

result[ix] = new double[_layers[_layers.Length - 1].Outputs.Length];

_layers[_layers.Length - 1].Outputs.CopyTo(result[ix], 0);

}

return result;

}

Which receive as a parameter the network inputs and return the corresponding outputs.

## The backpropagation algorithm

Let's finally go to the algorithm itself. I will try to explain it without using mathematical formulations, from which you can find a lot by searching on internet, since I try to convey a practical understanding of the procedure.

To train the network, we take a set of data with a number m of samples. Each sample consists of some values corresponding to the inputs and values that correspond to the outputs we expect to obtain; the backpropagation algorithm is of the type of learning we call **supervised**, since we need to know in advance the output we want to obtain. The algorithm proceeds in the following way for each of the samples:

- Calculate the output value of the network from the input values.
- Starting from the output layer, calculate the error committed between the actual output and the one provided by the network.
- This error propagates back to the previous layer, which also calculates an error value for each
**neuron**from the errors of the next layer. In this step we obtain the current error of the network, from which we obtain the average for all the samples. - Once the errors for all the layers have been calculated, the necessary updates for the weights of the neurons of each layer are calculated. In this phase there are two configurable parameters: the
**learning rate**, which allows making these updates in more or less small steps, and the**momentum**, which allows taking into account the value of the previous weight update. By playing with these two values, we can fine-tune the learning process. - Finally, these updates are applied to all
**neurons**in the network.

Once all the samples have been processed, if the average error is lower than a prefixed value, or if we have already processed the whole data set a maximum number of times, the algorithm ends and the network is trained.

In the code, this is the method of the **Network** class that processes each sample:

`private Task<double> TrainSample(double[] inputs, `

double[] finaloutputs, double learningrate)

{

return Task.Run(() =>

{

_layers[0].Process(inputs);

double error = _layers[_layers.Length - 1].ComputeError(finaloutputs);

_layers[0].ComputeUpdates(inputs, learningrate);

_layers[0].ApplyUpdates();

return error;

});

}

As parameters, it receives the input values in the **inputs** parameter, the expected output values in **finaloutputs** and the learning rate in **learningrate**, this rate is usually less than 1, in the order of 0.001 or 0.01. The network error for the current sample is calculated starting with the output layer, calling the **ComputeError** method of the **Layer** class:

`public double ComputeError(double[] finaloutput)`

{

double error = 0;

for (int n = 0; n < _neurons.Length; n++)

{

double e = finaloutput[n] - _outputs[n];

_errors[n] = e * _neurons[n].Delta;

error += e * e;

}

_prevLayer.ComputeError();

return error / 2;

}

For each **neuron** in the layer, the difference between its current output and the expected output is calculated; the error in the output layer for this **neuron** is the product of this difference and the value of the derivative of the **neuron**'s function at the point of its current output. This value is stored in the **_errors** array as the error corresponding to the current **neuron**. The **error** variable, on the other hand, will contain the mean square error of all the **neurons** of the output layer, which is the value that we will consider as network error for the current sample.

From the last layer we will calculate the error of the previous layer, with another version of the **ComputeError** method, since in the rest of the layers the calculation of the error is done in a different way:

`private void ComputeError()`

{

for (int n = 0; n <_neurons.Length; n++)

{

double esum = 0;

for (int e = 0; e < _nextLayer.Errors.Length; e++)

{

esum += _nextLayer.Errors[e] * _nextLayer.GetWeight(e, n);

}

_errors[n] = esum * _neurons[n].Delta;

}

if (_prevLayer != null)

{

_prevLayer.ComputeError();

}

}

In the rest of layers, we don't know what the correct final value should be, so what we calculate is the sum of the product of the error of each **neuron** in the next layer by the weight that this **neuron** has assigned to the output of the **neuron** whose error we are processing. Again, we multiply this error by the derivative of the **neuron** function at the point corresponding to the current output to calculate the final error of the **neuron** in the current layer. From here, the procedure is the same to calculate the errors of the rest of the layers, until arriving at the first, where the process stops.

With all the calculated errors, the next step is to calculate the value of the update for each one of the weights. This process begins in the first layer, the input layer, passing the current sample data in the inputs parameter of the **ComputeUpdates** method of the **Layer** class:

`public void ComputeUpdates(double[] inputs, double learningrate)`

{

for(int n = 0; n < _neurons.Length; n++)

{

for (int w = 0; w < _neurons[n].InputCount; w++)

{

_neurons[n].SetWeightUpdate(w, learningrate *

(_momentum * _neurons[n].GetWeightUpdate(w) +

(1 - _momentum) * _errors[n] * inputs[w]));

}

_neurons[n].ThresholdUpdate = learningrate *

(_momentum * _neurons[n].ThresholdUpdate +

(1 - _momentum) * _errors[n]);

}

if (_nextLayer != null)

{

_nextLayer.ComputeUpdates(Outputs, learningrate);

}

}

We process each **neuron** in the layer, and for each **neuron**, the weight of each input. The parameter **_momentum** can take any value between 0 and 1, so that we distribute the updating amount between two factors: on the one hand, the value of the last update that we have calculated for this weight (*(_momentum * _neurons [n] .GetWeightUpdate (w)*) and, on the other hand, the error of the **neuron** multiplied by the value of the corresponding input (*(1 - _momentum) * _error [n] * inputs[w]*). The value of **_momentum** determines whether we take into account, and to what extent, the past updates

With the **learningrate** parameter we make this update more or less slow, to finely adjust the accuracy and speed of learning. The calculation continues on the next layer, this time with the outputs of the current layer as input values.

Finally, the changes are applied to all the neurons, also starting from the first layer, with the **ApplyUpdates** function:

`public void ApplyUpdates()`

{

for(int n = 0; n < _neurons.Length; n++)

{

_neurons[n].ApplyUpdates();

}

if (_nextLayer != null)

{

_nextLayer.ApplyUpdates();

}

}

## The BackPropagationDemo application

As a demonstration of the use of this algorithm, I have prepared the **BackPropagationDemo** application, with the following user interface:

In the toolbar, from left to right, there are the following command buttons:

**Open file**: allows loading data in csv format or neural networks already trained by the program.**Save**: Allows the current network to be saved in a binary file.**Build Network**: Builds a new appropriate network for current data.**Train**: Train the network with the current parameters and current data.**Stop Training**: stop the learning process before it ends.**Process Data**: it allows processing the loaded data from a file and shows the results.

The first tab, **Network**, allows you to enter the parameters of the network. Always indicate an input layer and an output layer, each one with the number of **neurons** in them, but bearing in mind that we will reserve as many values at the end of each row of data loaded as **neurons** we have selected for the output layer. The rest of the values will be used as inputs. If we want to add hidden layers, we can do it by providing a comma separated list of values.

Training parameters are provided on the **Training** tab. The maximum number of iterations, indicates how many times the network has to process the complete set of data used for training; the parameters **learningrate** and **momentum** have the meaning indicated above; **Maximum error** provides a minimum acceptable error level to stop the training process before finishing all iterations. Finally, we indicate, for the data set that we are going to load in a csv file, the percentage of samples that will be used for the training. This value must be indicated before loading the data.

While training the network, we can continue interacting with the program, seeing a graph with the variation of the network error and the iterations made:

Finally, once the network is trained, we can check the result by processing the data that has not been used for the training:

This graphic is not a very orthodox one. In the upper part are the quantities of samples whose expected value has a difference with the value provided of the network lower than n / 10 parts of the maximum error committed by the network. In the capture, for example, there are 2474 samples with an error less than 1/10 of the maximum error.

The graph of the lower part represents the differences for each sample between the actual and calculated values, with an arrow that starts from the actual value and ends, as an arrow, in the value calculated by the network. These graphs only make sense for networks with a single output **neuron**.

I have added two sets of data that can be used to perform tests, **iris.csv**, obtained from the R application, with 4 data values and a result value for three lily species, and **postures.csv**, with 20 data values and one of result, which represent values of two different positions obtained with the Microsoft **Kinect** sensor, according to the normalization of the posture that I provided in these articles.

All the data loaded in the program is scaled within the interval (-1, 1), regardless of its value.