Machine learning with GPU is becoming a trend which is showing huge results and success recently. With more complex deep learning models GPU has become inevitable to use. In this article i thought to cover some introduction to GPU and its architecture model and how the nature of GPU complements machine learning / deep learning model process to become an inevitable partner.
GPU – An Introduction :
A graphics processing unit (GPU) is a specialized electronic circuit designed to rapidly manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display device. Modern GPUs are very efficient at manipulating computer graphics and image processing, and their highly parallel structure makes them more efficient than general-purpose CPUs for algorithms where the processing of large blocks of data is done in parallel.( Ref : https://en.wikipedia.org/wiki/Graphics_processing_unit ) .
GPU have thousands of smaller cores designed for handling massively parallel tasks where CPU on the other hand with few cores designed for optimal execution of general purpose serial execution of tasks.
I am putting a CPU and GPU specification from my laptop for a quick comparison
CPU Spec :
Intel64_Family_6_Model_94_-_Intel(R)_Core(TM)_i7-6820HQ_CPU_@_2.70GHz – 8 Cores ,16 GB RAM
GPU Spec :
NVIDIA Quadro M1000M – 512 Cores , 2 GB GDDR5 VRAM clocked at 1250 MHz
A general comparison of different GPU specification https://www.geforce.com/hardware/compare-buy-gpus , For example GeForce GTX 1080 GPU provides upto 2560 cores and 8 GB VRAM
Comparing CPU and GPU – High Level
You can notice few ALU on CPU when compared to 100’s of ALU with GPU, CPU architecture has higher clock speed compared to GPU.
CPU versus GPU Architecture – A step deeper :
You can also notice bigger L2 cache memory and a control modules which helps CPU to run complex sequential instructions but limited in terms number of threads, where GPU has 1000’s have of ALU with smaller L2 cache and smaller control modules, this allows GPU to perform 1000’s of tasks parallel in 1000’s of thread but limited with sequential complex instructions.
|Memory||6 -> 64 GB||768MB -> 6 GB|
|Memory Bandwidth||24 -> 32 GB/s||100 -> 200 GB/s|
|L2 Cache||8 -> 15 MB||512 -> 768 kB|
|L1 Cache||256 -> 512 kB||16 -> 48 kB|
Accessing instructions and data from L1 ,L2 cache helps the ALU to operate with high speed and reduced latency during execution, With CPU designed with high L1 / L2 cache helps to load more instructions and data needed for execution and hence have the ability to load and perform complex instructions , where GPU with more ALU and less L1 / L2 cache is limited with instruction and data and hence it is used to execute less complex tasks but when they are mostly used for same instruction on multiple ALU with different data. we can call it as Single Instruction Multiple Data (SIMD) or massively parallel instruction set.
Nature of Machine Learning model computations:
The training process in machine learning is generally time consuming with CPU involving huge computations . The process is iterative in nature with each iteration performing similar computation repeated for multiple records in the input data set. This mostly falls in the nature of massively parallel instruction set as described above.
For example a simple linear regression model ( y = mX + b )learning in machine learning involves running Gradient descent to optimize the gradients for the hyper parameters (m,b) for the input ‘X’ and output ‘y’ . This involves computing partial derivatives for each hyper parameter (m,b) for each record of the input data ‘X’ and average it for each epoch (iteration). Here the instruction set for a derivative calculation is same but has to be repeated for multiple data and this will be repeated for multiple epoch to arrive at the optimal hyper parameters for the best fitment of the model. With deep learning complex computations has been done for thousands of hyper parameters in each layer and repeated for multiple layers of architecture, the tendency of such computations extends to multiple days based on the nature of model architecture. Problems of this nature can be optimized to run in hours with massive parallel way utilizing GPU.
In my future blogs i am going to show some deeper analysis of executions with specific machine learning / deep learning models detailing the optimizations with GPU.
Finally a demonstration video of CPU versus GPU in a funny way