# A beginners guide to neural networks and A brief introduction about the intuitions and mathematics behind Neural Networks for absolute beginners to advanced

By Anindyadeep Sannigrahi

— — — — — — — — — — — — —

This is very much detailed tutorial on the Neural Networks

**Contents**:

**An intuition behind deep learning****What is deep learning (DL)****Different types of neural networks****Some disclaimers and notes****Neurons or perceptrons****Basic maths in a neuron****The feed forward neural network (stage 1)****The feed forward neural network (stage 2)****Back propagation in neural nets ( level 1)****Back propagation in neural nets ( level 2)****The total back propagation in a neural network (level 3 )****Discussion of some hyper parameter optimization techniques and gradient descents with graphs****Acknowledgements**

— — — — — — — — — — — — —

# An intuitions behind the word “learning” and deep learning

Have you ever seen that , when a baby is learning to speak his/her mother tongue very gradually. The journey starts with a random sweet cry but eventually with age it grows and take shape in a rudimentary language with some funny pronunciation. And finally they get their chance to say …“Mamma mamma … please please buy me this toy please… 😁😁😁”

Have you ever seen that , a baby in its initial stages can classify his / her mom and dad with other people and then eventually recognize several objects of her likes and dislikes…

Heyy man… but why I am talking this all these things and what’s the significance here….

Okk let me tell you that can you find a common trend in this all things that I have said above… It is what we say as “learning”… Learning is a relatively long and a dynamic process of anything… that needs a predefined prerequisites i.e a data, some training , some validation, then more training and finally with “practice” of the same thing repeatedly makes us perfect in that very small or big things.

To be specific when the baby is seeing that his / her needs are not getting fulfilled with that cry(only) then he/she starts to mimic his/her parents discussion (which is the start of the training). Now the baby tries to say or pronounce the same things but things are not “accurate” and clear in the first round. For eg. Pronouncing “fish” as “chij” .. Mamma can’t understand.. so what he does is he starts to increase the accuracy by practicing more and more the same sets of words and every time he/ she fails … he /she tries not to do the same mistakes again and again and finally it builds it’s accuracy and pronounce “fish” as “fish” only…

The same sets of things happen in classification of things and different objects or understanding any concept in maths by doing several examples at one’s and trying the exercises.

So who is responsible for this , what is so called as “learning” …? The answer is our brain. And how the brain is doing all this …? Its all due to that complex networks of neurons and axons … Okk enough Biology😏😏.

(above) A simple biological neuron…

(above) A complex neural network in our brain 😯😯

Soo the basic pathway of learning is..

Soo congratulations 🤩 you have got the basic intuitions behind the “learning” of deep learning and the main steps behind it.

**What is a Neural Network?**

In one word it is the most basic and the most elementary part of the field called deep learning which is the main building blocks of AI via machine learning.

And technically neural networks or the perceptrons are that elementary part in a neural net which is performing the basic functions of both feed forward and back propagation and updating its weight and biases in order to fit a certain way that it would predict or classify the other given data from the trained datasets….

So the collection of this all these neurons makes a layer and makes a total network of neurons

# Different types of neural networks

There are different neural networks architecture used now days … some of them are:-

**Artificial neural networks****(ANN)**— this is the most basic neural networks architecture and used for building the dense architecture of all other networks

**Convolutional neural networks****(CNN)**. It is a complex neural network architecture which is mainly used in image or face recognition or computer vision problems in general. There are several models of CNN required according to the use of it and the accuracy.

**Recurrent neural networks****(RNN)**. This is one of the another most interesting neural networks used for the**sequential datas**, like speech synthesis , sentiment analysis, deep speech, and all the cool stuffs that our cool Google assistant or Siri or Alexa does for us.

( A general RNN model )

( A LSTM RNN model )

**Reinforcement learning**or general AI. Yeahh the king of the neural networks. The learning does not requires large no of the datasets for maintaining the accuracy. It is something like our**feedback system of our brain…**for eg: if you have done a cool job in anything or any field then we appreciate you and you eventually get motivated and you unknowingly become the master of this field and the vice versa punishes you and you practice more to become more accurate enough so you not repeat this again… yeahh you have got this this neural network work like this way… A**reward policy system to the agent based on a certain surrounding**of its work.

( control flow of a reinforcement learning model )

Now there are several sub divisions of all these. If you want then Google it up to know more . We will focus here on the foundation of the mathematics of the Artificial neural networks (ANN)

Enough intuitions and stories , now mathematics time 😁😁😁

**Disclaimer** : I would assume that you know some basics thing about vectors and matrices , composite functions , gradient calculation, and nested / chain rule of the gradient calculation.

If not it would be easier for you if you get to know all these concepts then it would help you to get the equations easily.

**NOTE: (for absolute beginners)**

Please don’t get scared to all the expressions used here you would just need to keep patience and I have provided the meaning of each expression

**NOTE:**

The expressions and the maths are original and all hand written with best hand writing kindly stay tuned

Thank you

**Let’s start: **😁😁😁

# The neuron / perceptrons :

The neuron is the most elementary part of a neural networks as discussed before as a definition

Have a look to it

*( Fig : 1 )*

Above is the pic of a basic neuron of a neural network.

It mainly does these main functions

**Input times weights****Add a bias****Activate**

**Weights :**

are basically some randomly initialized values which basically makes the product with the input in order to have a value.

**Bias** :

is also a randomized value initially which mathematically helps the equations not to have zero values and technically it biases the neuron to a certain extent in order to be much efficient in the work.

**Sigmoid :**

It is a activation function. Now an activation function is a special type of values which basically pushes this value (i.e. input times weight and add a bias ) to a probability distribution function, or generally it pushes this in the range of (-1,1) or (0,1).

Sigmoid basically does this in the range of (0,1) .

*( Fig: 2 )*

Above is the graph of a sigmoid function.

There are several other activation functions like :

**Softmax****ReLU****Leaky ReLU****tanh**

etc….

We will discuss these all in details in the discussion of CNN or RNN coz those are used there very much…

Soo after calculating the sigmoid function value we get a value which is called the predicted value. But during training there is also a true value. ( Similar like math examples we generally do, our attempt and the real answer)

So after seeing this we would go for the calculation of Loss and then then the back propagation. But wait let us discuss this forward connection in more details.

# The feed forward networks (level 1 )

So this is the first level to understand step by step …

*( Fig : 3 )*

This is simply a two neuron simple ANN having four inputs as shown above . Now have a look to this…

*( Fig : 4 )*

So here comes the calculation part what we have done is :

**Input times weight****Add a bias****Summation****Activate**

For each hidden neuron we do this same so in order to get two value one of ( h₀ ) and other ( h₁)

Now for the second time, in the calculation of predicted value , we treat the two hidden neuron as the function of weight and do the similar job as done just before.

We have used “**f**” as the activation function coz it can be any of these i.e either sigmoid or soft-max or RelU or tanh etc. Generally , sigmoid / soft-max is used for the final computation purposes.

So congratulations 🤩 you have made the level 1

Okk now before going to level 2 we would see all these calculations in a new perspective i.e in the form of **matrix and vectors.**

As shown in the fig:

*( Fig : 5 )*

Now understand the meaning of the symbols used here:

**W₀₀ means : the weight on h₀ due to the input x₀**

**W₀₁ means : the weight on h₁ due to input x₀**

so now what we have to do is that : treat the input; weights; biases as a set of vectors and matrices

generally the input is treated as a column vector but we are treating it as a row vector for the ease of calculation and understandings ; coz at the end it would give us the same values

let input be a row vector as

[ x₀ x₁ x₂ x₃]

and similarly we get the matrix of the weights as :

*( Fig : 6 )*

Now the main thing to wonder about is what would be the dimension of the weight matrix?

Its simple what you only have to do is:

If you want to know the no: of weights , i.e the dimension of weight matrix suppose of h₂ then no of row would be the no of neuron on h₁ and no of column will be the no of neuron in h₂

So the dimension of weight h₂ will be

**[( no: of neuron)ₕ₁ ( no : of neuron)ₕ₂]**

Which has been shown in the fig above also.

*( Fig : 7,8 ) ( from up to down )*

If you are familiar with matrix multiplication, then I am pretty much sure that you can understand this . Also we represent H as a vector consisting of : [h₀ h₁]

So we can represent it as a general equation as :

**H = σ(X̅W + B)**

As the general equation of representation of any feed forward equation.

There would be several sources where the equation may be in the form :

**H = σ(WX̅ + B)**

But it’s all depended on the matrix and vector dimension formation i.e may be the X would be a column matrix…

Hey finally level 1 completed … congratulations 🤩

# Feed forward ( level 2 ) :

Take a look at this :

*( Fig : 9 )*

It’s what we call a general feed forward neural network architecture… having :

- input vector X = [ x₀ x₁ …. xₙ ]
- Hidden layers H₁ H₂ …. Hᵣ
- output layer Y =[y₁ y₂ …. yₙ]

So you have these things as given now its time for the calculations

Get ready…

*( Fig : 10 )*

Its the simplified image of the first image where I have shown the weight symbols and then dimensions of the matrix of the weights in [ ]

I will introduce you with the symbol meaning with examples for better understanding…

**Meaning of symbols (imp)**

- ( H₁ H₂ H ₃ …. Hᵣ ) are hidden layers not vectors just layers.
- for eg, for the layer H₁ we have total ‘ a ’ no : of neurons. And each neuron are represented in the form of :
**( hᵢʲ ) i.e where i represents the index of the neuron and j represents the location of that very neuron on that particular layer. for eg, the neuron h₀¹ says that it the first neuron h₀ lying on the layer H₁** - The weights are also similar like it.
**The weights are represented as Wᴷᵢⱼ where K = < Hᵢ > i.e location of the weights on which layer and i, j represents the index of the weights , i.e i represents the index of the previous neuron and j represents the index of the present neuron.**For eg, the weight W¹₀₀ represents the weight belong to the layer H₁ and it is coming from h₀ of H₁ to h₀ of H₂ . Another example ,**the weight Wˣ₀₁ represents the weight is belonging to the input layer X and it is coming from h₀ of X to h₁ of H₁**

*( Fig : 11 )*

So now I guess you got , how to read the symbols of the weights as all other are in the same conventions.

Biases are not given much importance for the simplicity of the upcoming equations.

Now its time for the calculations part , take a look to here:

*( Fig : 12 )*

The equations I am using here are same as I have used it before but the no of variables used are large

But as shown above the matrix of weights , Biases, Input vectors …

So watch out here…

*( Fig : 13 )*

Are you getting the similarities… with the previous equations , yesss if so , them have a look to here…

*( Fig : 14 )*

So I guess you have got up-to here, so it should also have a general equation… yeahh so we can write any feed forward as :

*( Fig : 15 )*

But why I am writing this , are there any test cases which makes it totally valid ? Yess so here it comes…

*( Fig : 16 )*

So except the first trivial case we can generalized it…

So whoooo hoooo congrats 🎊 you have made it 😁😁 you have made the second level too and the feed forward network calculations comes to an end…

# The epochs

Epochs are nothing but a kind of time period i.e the total time taken in a total forward and backward propagation is called an epoch.

Where I have shown the forward propagation. Now its the time for the back propagation shown here later.

# The loss :

You are some what familiar with this thing… suppose you have given your math test. And you got some correct and some wrong … it’s a normal situation… now the thing is that if I want to get the percentage of the accuracy of yours then I first have to see the difference between the total no of correct and incorrect, which is basically the loss . Yess , **loss**

So loss is also present in the neural network too … you can see that at the end of the forward propagation we finding the sigmoid values and getting some numbers. But those are predicted value not the true value , so to build the accuracy , what we have to do is …

**First calculate the loss .****Back propagate through it.**

Take a look at here of how to calculate the loss in a neuron…

*( Fig : 17, 18 )*

Soo yeah this is what is it..

Loss ( L ) = Σ(yₚ - yₜ )²

Which is basically the mean squared error ( MSE ) loss .

Two very popular type of loss are

- MSE
- Cross entropy or log loss

Loss or cost function are very much similar to each other where cost function is denoted as ( C )

Nowww we will slowly move to the back propagation level 1,2 ,3

Level 1 would lead to a one neuron system

Level 2 would lead to three neuron system

Level 3 would deal with the general back propagation.

Are you readyyy …. 😤😤😁

Let’s go….

# The back propagation (BP) ( level 1 )

We will start from the one neuron system ,

- calculate the forward propagation calculation
- calculate the loss
- Start back propagation by calculating the gradients of loss w.r.t weights and biases and update it with some optimizers .

( we will not talk about optimizers here for the simplicity of calculations)

Soo take a look at here…

*( Fig : 19 ,20 )*

I guess you are familiar with the symbol Z if not it is the value without the sigmoid activation.

i.e Z = (WX̅ + B)

And that’s it…

Back propagation is relatively a tough topic but it would be easy if you follow this simple steps

- Always follow the arrow of BP coz it helps a lot to know the dependency of the weight and biases on which neuron
- See where the weight is connected with and accordingly calculate the gradients, as shown in the next figure…

*( Fig : 21 )*

Note : just for calculations purposes we are supposing ( hᴴ = yₚ ) . Coz it’s not the same every time… especially in the recurrent neural network… that would be discussed in the later stories…

If you see in fig 20 and follow the arrow then here is the steps you have to follow in order to find the derivative of loss ( C ) w.r.t. weight W :

- First for any weight gradient start from the last and see that where the arrows are pointing in a continuous direction
- i.e for δWᵢⱼ is depended like this : C -> h -> Z -> Wᵢⱼ
- So first calculate derivative of C w.r.t. h , then derivative of h w.r.t. Z , and then derivative of Z w.r.t. W … as a continuous chain of partial derivatives

( I guess you know chain rule otherwise please learn that it’s the heart of back propagation )

The similar steps are to be followed for calculating the derivative of loss ( C ) w.r.t. the bias in the same way.

Take a look to this…

*( Fig : 22 )*

So if we can calculate the gradients w.r.t weight W, bias B then why not for the hidden neuron ( h ) , …… 🤔🤔🤔 … but why?

In one word it’s because … in the time of complex back propagation calculations , it is very useful to take out this particular derivative, as a continuous recursive derivatives which is shown later…

At first let us calculate the gradient of loss ( C ) w.r.t. hidden neuron ( h )

Watch out here…

*( Fig : 23 )*

Now the only thing left is the updation of the weights and biases

NOTE : you may wonder that why I am not updating the hidden neuron h , then the answer is simple, the values of neurons are dependent on this weights and biases, so those are updated and gradients of neuron are done as a tool of recursive partial chain gradient calculation.

Where η is the learning rate which is very very less than 1 and also adjusted as a part of hyper parameter tuning.

This total back propagation is done up-to a given no: of epochs and the main target is to reach the global minima .

**But why minima and why global minima ?**

Minima of a function is the minimum of the range of a function on a given domain

A function may contain several minima_s and maxima_s but a global minima would be the minimum of the all other local minima_s .

Something like this…

*( Fig : 24 )*

Local and global minima_s of a general graph.

Heyyy you have made it , the level 1 of BP gets completed …. 😀😀

🤯🤯 <- something like this , if so take a break and then start , no hurry to finish….

Cool, cool

# Back propagation (level 2 ) (two neuron system )

Now we will do the same things literally the same things but just for a two neuron system…

Ready !!! Here we go …

*( Fig : 25 )*

Above is a simple two neuron system ( we are not focusing on the symbol so don’t mix with the symbols used before , all are independent of each other… )

So now have a look to this…

*( Fig : 26 )*

Some things should be seen properly…

- See how the gradient is squeezed to a simple gradient which is the gradient of the loss w.r.t. hidden neuron , which is basically I was talking before…
- Now as we know the gradient of loss w.r.t. the hidden neuron then we can easily work out the values of the total gradient.

So we came with these conclusions as shown below :

*( Fig 27 )*

So we got the values of the gradients w.r.t. the weights of the previous and the present neurons

# One step up :

*( Fig : 28 )*

What’s new here :

- We will now be going to analyse for a three neuron system
- The dependency of the weight for a present neuron to all the previous hidden neurons.

Now let’s calculate the back propagation in this case…

- First the feed forward calculations and summation is required due to multiple neurons.

*( Fig : 29 )*

( NOTE : the overwritten part is Hʳ⁺¹ )

Here j is just a constant showing the index of a neuron at a time.

An eg is shown :

*( Fig : 30 )*

The without sigmoid value of h₀ i.e Z₀ is shown here.

So to be very general we can get the following conclusions… as shown here….

*( Fig : 31 )*

Now the back propagation in this case…

First take a look at this to know the dependency in the back propagation….

*( Fig : 32 )*

So for the back propagation what we really need first to calculate the Loss. Now as loss is an unique value so what we gonna do is that we basically have taken the summation of the difference of the truth vs the predicted values and hence compute the average , as shown above in the picture…

*( Fig : 33 )*

Here we are computing the gradient of the cost function or the loss w.r.t a particular weight which is generalized in the form of

Wₖⱼ , where k is a variable and j is a constant mainly showing the index of a particular weight.

I have shown a numerical example of calculation of the gradient of loss w.r.t

W₀₀ in the above figure…

So now just take a look for the updation of the weights as shown :

*( Fig : 34 )*

Now its time to do the same steps we are doing every time… calculation of gradient w.r.t. a general hidden neuron as shown here…

*( Fig : 35 )*

Can you spot a new thing here … we are basically adding the gradients of different other neurons basically. But whyyy ? 🤔🤔🤔 … the answer is served here….

*( Fig : 36 )*

Can you get the answer…. Yess , you can see here that the values of the very particular neuron ( hⱼᴴʳ ) is depended on

[ h₀ᴴʳ⁺¹ , h₁ᴴʳ⁺¹ , …….. , hₙᴴʳ⁺¹ ]

( NOTE : the above [ ] is not a vector , it's just a group representation )

And hence we are computing the gradient of loss w.r.t all other adjacent neurons to which it is basically connected and hence is the result show here, below:

*( Fig : 37 )*

Hey man 😀 you have made it ,👍👍👍

High five 🙌🙌🙌

And just keep this in mind that all the same processes are also done for the gradient of loss w.r.t the biases too

The back propagation ( level 3 ) ( general back propagation )

Now the last bonus level 3 looking forward for the general back propagation techniques for a particular neuron, neuron's weight ….

Take a look here….

*( Fig : 38 )*

So this is a general neural network, where I have shown you a portion of it with three hidden neurons….

So now up-to here I guess , you have quite mastered up the feed forward connection

So here we will directly jump towards the back propagation… have a look below…

*( Fig : 39 )*

As told before about the dependency of one neuron to several other….

And thats why the equation surrounded with big brackets is squeezed in the one with small brackets in the next equations as show in fig : 39

So to be very specific about the generalization of back propagation of a general neural network would be something like this as shown below….

*( Fig : 40)*

Take time to understand this …. But it’s what all about the gradients. And we can now update all these weights as done before….

Now there are more about the generalization of back propagation i.e like taking all these gradients and converting it into as vector outer product… but as a beginner friendly blog we can end up-to here only …… sooooo yeahhhhhhh we have made it…. Completed the level 3 too . A big congratulations 🤩🙂🙂😁☺️.

# Problems in back propagation and some techniques to fix it.

While seeing all those equations does this happened to you at once ( if you are not familiar with it ) 🤯🤯🤯🤯 , something like this ?

Hmm …. by the way our computer bro also undergoes through all these problems. Technically what we discussed till now is not optimized algorithms…. Soo what to do ? 🤔🤔🤔

Let’s introduce you all with the problems and some techniques and optimizers for it.

# Problems :

I will discuss some of the common types of problems …. So some of the problems which is often faced in deep learning or machine learning problems in general . Some of them are :

- Vanishing gradient descents
- Exploding gradient descents

so lets start with the vanishing gradient descent problems. Now you have seen the long chain of the partial differentiation and the long chain of the recursive differentiation, so that time due to these long chain of the multiplication values the older weight updations get very much affected, sometimes it has been seen that the old and the new values of the weights are nearly same and hence there is no significant updation processes are actually happening and so it has been seen that at times though after a long number of epochs , the accuracy is not much affected and the global minima is not reached actually ( to be specific )

*( Fig : 41)*

So in the above figure what we can see there in the weight updation equation. To observe this equation visually the above graph is one of the simplest approach to visualize it. You can see that in the above graph the point P is slowly slowly guided to the point O , in the target to reach the global minima. So in the vanishing gradient descent problems the main problem to be noted here is that , the point takes a very large amount of time reach that global minima.

Below is the simplified mathematical explanation given in order to understand this properly. So here we go…

*( Fig : 42)*

*( Fig : 43)*

in the fig : 42 we can see that we get the derivative of the sigmoid activation function whose values lie in the range of (0, 0.25), as shown in the fig: 43

so the range of the values for the derivative of the sigmoid activation is much smaller than the range of values of the normal sigmoid activation function. Further the learning rate is generally very less than 1 (eg : 0.001 )

So during the weight updation process , the gradient of the weights to the product of learning rate makes a very less value, and so when that value is subtracted with the old value of the weight , the new value of weight becomes approximately equal. To understand it properly I have taken a friendly example as shown below :

*( Fig : 41 )*

*( Fig : 42 )*

So as you can see here the old and the new values of the weights are nearly equal so at the next epoch , the values would be more equal to each other, and thus it would never reach the global minima as shown in the above figure…

*( Fig : 43)*

So it is one of the most common problems in these filed

Another similar kind of problem occurring for the large magnitude of the weights is the exploding gradient descent…

From the name we can understand that a explosion in the magnitude of the weights occur here during the time of the weight updation. So technically the point ( Weight ) jumps along the graph and never reach the global minima and it becomes a total mess. You can visualize this by the figure given below:

*( Fig : 43 )*

Now I have a familiar and a simple mathematical approach in order to show this in the figures below:

*( Fig : 44 )*

I guess that What I have done up-to here is quite familiar for now after practicing lots of these similar kinds of partial chain differentiation.

Now watch out the main part here…

*( Fig : 45)*

Now after multiplying all these values we get an absurd value and that does’t make a sense at all in the accuracy of our neural networks.

*( Fig : 46)*

So what is the remedy here ?

So its the second last and a bit big topic …. Have patience and here we go

# Techniques to avoid these types of the problems

I would like to show different types of charts in order to show the different approaches in order to avoid these problems in general and Vanishing and exploding gradient descent problems in particular… So here it goes …

*(Fig: 47 )*

So what are the approaches ?

**. Dropout of the neurons**

**. Use of other activation functions like ReLU and leaky ReLU or tanh**

**. Stochastic Gradient Descent (SGD)**

**. Mini batch SGD**

**. SGD with momentum**

**Now we will discuss these methods one by one…**

# Dropout of the neurons

Dropout of the neuron is something we can say that we are randomly choosing the neurons in the sets of a large number of neuron which is going through the total phase of training via feed forward , validation , back propagation for only that sets of the neurons.

At the next set of the epoch the neurons going through the training at the previous epoch would change and it would be shuffled again and randomly set up for the next set of the training phase

The randomization of the neuron is maintained by a dropout ratio that determines that how many neurons should be randomized and how many not in that whole sets of the neurons in that particular layer.

you can see the figures below for a very clear idea.

*(Fig: 48 )*

Here the cross are showing that these sets of the neurons are not taken in order to maintain the space and the time complexities of the epochs…

Where P is the dropout ratio , here P=0.5 which means that half of the neurons in that very layer.

Now i will show you the drop out functioning of the neurons for a three layer neuron as given below

*( From up to down fig : 49, 50, 51)*

The above figures mainly shows the propagation of a three layer neurons at three epochs with the dropout functioning. You can see that the neurons are randomly set and the total processes of the feed forward and back propagation is occurring at that epoch…

Due to the dropout_ing the neurons the neurons would not undergo any of the **over fitting** or the **under fitting** problems or any other problems faced during the machine learning or deep learning problems.

# Use of other activation functions like ReLU and leaky ReLU or tanh

We have discussed about different types of the activation functions in this blog , like sigmoid , softmax etc . But there are other types of the activation functions which are sometimes very much useful to overcome the under- fitting or the over-fitting problems. So some of these functions are

. ReLU

. Leaky RelU

. tanh

Below are some more about these functions…

*(Fig: 52)*

Above is the a simple RelU function

Now its a very interesting function coz its value does not affect much that sigmoid does in the time of the back propagation i.e the gradient calculation leading to the vanishing or the exploding gradient descent problems…Here we go…

*(Fig: 53)*

I guess now you get it why i was talking about the ease of the gradient calculation before… Because it would always give a binary result i.e either 0 or 1 which does’t raise the problems of arising of complex calculation that reduces the space and the time complexities in the algorithms drastically…

Another kind of function used very much is the tanh function which look like it… Here it is…

*(Fig: 54)*

Here is the tanh function is f(Z) and the derivative of the function is f’(Z)

We can the derivative of the tanh function like this…

*(Fig: 55)*

And so how we get the graph of it…

**Stochastic Gradient Descent (SGD)**

I you have remembered that during the loss calculation , the loss is an unique value that we basically maid the summation of all the losses and averaged it in order to get a particular value which is further used for the iterations in the back propagation process, that time what we basically used is the simple gradient descent methods in order to deal with it, something like this…

*(Fig: 56)*

Now Stochastic Gradient Descent (SGD) is another kind of method where we don’t take the summation of all the losses rather compute it for each of the neurons in the layer as well as back propagate with that loss value and hence make the weight , bias updation accordingly

Now i am showing the simple SGD in this figure…

*(Fig: 56)*

So you can see that ( Ls) is the Loss in the SGD method and that-s how the weight updation processes also occurs accordingly. It helps to overcome the problems of the Vanishing and exploding gradient descents to such an extent but it is not that much optimized if we see it in the time complexities …

**Mini batch SGD**

Now in mini batch SGD all the things are same but it is similar to GD but much optimized i.e we are also making the summation of the losses but not up_to the last neurons rather for some of the neurons…

It is more elaborated in the figure below

*(Fig: 57)*

As you can see in the above figure k <n so this means we are not taking all the neuron’s loss in the consideration in order to calculate the total loss

And the path of the mini batch SGD in order to reach the global minima is something like this..

*(Fig: 58)*

So now we can relate GD , SGD , mini batch SGD something like this…

*(Fig: 59)*

Yess you are write when :

. K=1 its SGD

. K= n its GD (Where n is the total number of the individual losses )

. 1< K < n its mini batch SGD

Now if we see the movement of the points in order to reach the global minima in the case of GD and compare them then the movement of the paths are something like this…

*(Fig: 60)*

You can see that the zig-zag path is followed by mini batch SGD , we used to say this technically as noise in the path… which is a bit annoying in the time of the computation …

*(Fig: 61)*

So in order to overcome the problem there is another method of SGD with momentum , which is based on the concept of exponential moving average… which goes something like this…

I am gonna explain this moving average concept as a beginner friendly way , mainly focusing in the perspectives of the SGD with momentum

So here we go…

*(Fig: 62)*

I guess there’s no explanation required here coz , we have just taken some data points in some following time… Now watch out here…

*(Fig: 63)*

Now we are using this predefined series in the corresponding time stamps and each time the present data point is gamma times the previous one…

And so how we are defining the general data point in the same way…

so now we will use this same concept for the weight updation also… So here we go..

*(Fig: 64)*

So hows is it… Okk , but its not that much now in our consideration for the derivation purposes … just know these formulas so that you can know about the origin of all these concepts while you are coding in python, in order to be stronger base in order to make more accurate models in future…

By applying this concept the noise of the graph is decreased in a huge manner which you can see here…

*(Fig: 65)*

See , how its converging to global minima with lesser noise and more the training and optimized it is ,lesser will be the noise and more the accuracy…

So we are completed with all the major portions that would lead to the foundation of the deep learning and the mathematics behind it…

Okkk just some time more needed , a last topic is left … i.e some other optimizer and the weight initialization techniques that increases the accuracy from the very beginning only…

# Some weight initialization techniques with some special optimizers…

Some special types of the optimizers that are used to maintain the accuracy of the neural networks in a great amount are as follows:

. Ada_grad optimizers

. Ada_delta optimizers

. RMS prop

where the Ada_delta and RMS prop are very similar to each other…

I will just give a brief idea about both of them in short…

So here is the Ada_grad optimization

*(Fig: 66)*

You can see that we are using a very small insignificant constant epsilon in the under root its because…

*(Fig: 67)*

So in order to overcome the problem of being undefined network this epsilon is basically used…

But Ada_grad is not a very optimized solution in terms of the space and the time complexities….

Sometimes it may lead to the vanishing gradient descent problems as you can see from the graph…

*(Fig: 68)*

So another very important technique is the Ada_delta and RMS_prop which are quite similar to each other… Lets see the Ada_delta

*(Fig: 69)*

So these are about these optimizers in short you can research more once you get more familiar with the basics…

# Weight initialization techniques…

I would show these different techniques just by using a chart , and here we goo…

*(Fig: 70)*

So the different techniques are…

**. Simple uniform weight distribution**

**. Xavier — Groat distribution**

**— — — — . Xavier normal distribution**

**— — — — . Xavier Uniform distribution**

**. He_ init Distributions**

**— — — — . He_uniform distribution**

**— — — — . He_ normal distribution**

Here also I would discuss in short the main thing to remember is that the weights are always lying in the uniform range or the normalized range in a manipulated weights for Uniform and normalized techniques respectively

So here we go with the first one…

**Simple uniform weight distribution:**

I will be directly showing you about the weight distribution techniques

*(Fig: 71)*

As i have told earlier , the weight would be in the range of the uniform distribution in what shown in the given brackets

**NOTE : fan_in is basically the input , fan_out is the output**

Now it would be much clear if I would show you with this examples as shown below…

*(Fig: 72)*

*(Fig: 73)*

Similarly the He_init is also working the in the same way just the numerical values are different.

Here just take a look into it…

*(Fig: 74)*

So here is what all about these weight initialization techniques… I am not going deep here , it would be totally depended about my performance in this as well as your views and comments.

So yessss , you have completed the beginners guide to the introduction to deep learning and the basic mathematics behind it. Congratulation….😁😁😁🎁🎁🎁

Please tell me about my first post and I really apologize for the ordinary quality of some of my photos … With your support I will try my level best to upgrade this ….

# Some Acknowledgements…

I sincerely acknowledge Krish Naik , He is my first teacher in the you tube platform in this field of deep learning

I am also thankful to Siraj Raval in order to grow my interests hugely in this platform

Also my fellow You_tubers , for giving me motivation and updates.

I am also very much thankful for all your support of my fellow friends

Thank You….

'