Batch Normalization

Today let's enter inside the Deep Neural Network to understand what actually batch Normalization is, what all problems can we face if Batch Normalization is not used and advantages of using it in our model.


In machine learning we have been constantly using this technique of feature scaling such as standardization or normalization over our features so that there is uniformity in ranges of all features and there's no inbuilt bias from the model over a particular feature or set of features.

Similarly in Neural Network  we normalize our input by mean centring to zero and variance scaling to unity which is also known as "whitening".

Okay so as of now we have normalized our inputs to zero mean and unit variance. Sounds good . But what about the values deep inside the network?? Will it follow the same distribution as that of input values distribution??

Let's find out..!!!
We are here extensively dealing with Deep Neural Network which are having a lot of hidden layers present inside it.

Data is fed to the model in mini batches which helps to have an estimation of gradients of loss of overall training set and also computation becomes easier and efficient.  But one thing here worth pondering is that even a slight changes in model parameter in initial layers may cause very large changes in later layers. 

Allow me to give you an analogy way different from Neural Network. I hope everyone is aware of 1 percent rule described by James Clear in his book Atomic Habits.
It states that if you can get 1 percent better each day for one year, you'll end up thirty-seven times better by the time you're done
Extremely sorry for taking you off topic but what I want to convey to you from this analogy is that simply small changes in the initial layer can cause a large change at the later layer, below is the naive proof for it. 

As you can see if there is even 1% change in parameter in initial layer then at the end of the l th layer, let's say 9th layer the total change will be 9.3 % as that of the initial value, which is a great change and which in contrast is same even at the end of the lth layer if there is no change at initial layer. So now as we are clear let's move forward

Let's take a look at how a single node identifies the inputs and throws the output.
In the below example input is x =  <x1, x2, x3> and weights corresponding to it w =  <w1, w2, w3> . The output of the node before applying activation function is 
Z = w*x + b and activation function used over here is ReLU which is defined as  ReLU(z) = max(0,z).

By using this ReLU activation function we have introduced the nonlinearity in data. 
  While moving forward deep inside the network a lot of computation is taking place, outputs of a particular node  then becomes input to the next node and this goes on till the end of the network and therefore 
at each layer, distribution of network activation can change drastically. This can slow down the training process and is known as  Internal Covariate Shift.  It is defined as  the change in the distribution of network activations due to the change in network parameters during training.

We have to  ensure that the distribution of nonlinearity inputs remains more stable as the network trains, to accelerate the training process. 


Now that we got the Terminology clear let's have some easy talk. We know that whitening the inputs of the model accelerates training but what if we extend this same idea to each and every layer further or at some interval  so that we maintain  a same distribution through out the network.
Yeahhhh...!!! indeed it's a great idea. And this is exactly what Batch Normalization does.






Let's say input to a Batch Normalization (BN) layer is a mini batch B with m entries in it such that B ~ <x1,x2,.....xm>  it's output is y.
So the series of events taking place inside this BN layer is firstly finding out the minibatch mean , then computing the variance and then using those values for finding the normalized value , finally shifting and scaling with the desired value with scaling factor gamma and shifting factor beta. During normalization, we use epsilon at the denominator to avoid division by zero error which can be caused just in case variance becomes zero.  





Wait what..??? You might ask from where this factor gamma and beta appeared ?
It's not necessary that you will always have zero mean and unit variance distribution. For instance if we normalize the inputs of sigmoid, it would constrain them to a limited region i.e. most of the input will be in a concentrated region which can decelerate learning.

Therefore we allow algorithm to find specific gamma  and beta values. Note that beta and gamma are parameters that are to be learnt during training.


The BN transform which scales and shifts the normalized value is a differentiable transform that introduces normalized activations in a network, and being differentiable, it can sustain during back propagation.

It should also be noted that Batch Normalization has little regularization effect on the network as well.

Advantages of Batch Normalization:
  • Speeds up Training process.
  • Allows to have a higher learning rate for faster convergence.
  • It makes weight present at deeper layer in Network more robust to small changes to weights in initial layers.
  • It has slight regularization effect.


References:
  • Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift by Sergey Ioffe and Christian Szegedy



Comments

Popular posts from this blog

Covariance and Correlation

Split it up - Part 1

Why activation function is needed in Neural Networks???