GPU and Machine Learning

Machine learning with GPU is becoming a trend which is showing huge results and success recently. With more complex deep learning models GPU has become inevitable to use. In this article i thought to cover some introduction to GPU and its architecture model and how the nature of GPU complements machine learning / deep learning model process to become an inevitable partner.

GPU – An Introduction :

A graphics processing unit (GPU) is a specialized electronic circuit designed to rapidly manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display device. Modern GPUs are very efficient at manipulating computer graphics and image processing, and their highly parallel structure makes them more efficient than general-purpose CPUs for algorithms where the processing of large blocks of data is done in parallel.( Ref : https://en.wikipedia.org/wiki/Graphics_processing_unit ) .

GPU have thousands of smaller cores designed for handling massively parallel tasks where CPU on the other hand with few cores designed for optimal execution of general purpose serial execution of tasks.

I am putting a CPU and GPU specification from my laptop for a quick comparison

CPU Spec :

Intel64_Family_6_Model_94_-_Intel(R)_Core(TM)_i7-6820HQ_CPU_@_2.70GHz – 8 Cores ,16 GB RAM

GPU Spec :

NVIDIA Quadro M1000M – 512 Cores , 2 GB GDDR5 VRAM clocked at 1250 MHz

A general comparison of different GPU specification https://www.geforce.com/hardware/compare-buy-gpus  , For example GeForce GTX 1080 GPU provides upto 2560 cores and 8 GB VRAM

Comparing CPU and GPU – High Level

You can notice few ALU on CPU when compared to 100’s of ALU with GPU, CPU architecture has higher clock speed compared to GPU.

CPU versus GPU Architecture – A step deeper :

You can also notice bigger L2 cache memory and a control modules which helps CPU to run complex sequential instructions but limited in terms number of threads, where GPU has 1000’s  have of ALU with smaller L2 cache and smaller control modules, this allows GPU to perform 1000’s of tasks parallel in 1000’s of thread but limited with sequential complex instructions.

CPU GPU
Memory 6 -> 64 GB 768MB -> 6 GB
Memory Bandwidth 24 -> 32 GB/s 100 -> 200 GB/s
L2 Cache 8 -> 15 MB 512 -> 768 kB
L1 Cache 256 -> 512 kB 16 -> 48 kB

Ref : http://supercomputingblog.com/cuda/cuda-memory-and-cache-architecture/

Accessing instructions and data from L1 ,L2 cache helps the ALU to operate with high speed and reduced latency during execution, With CPU designed with high L1 / L2 cache helps to load more instructions and data needed for execution and hence have the ability to load and perform complex instructions , where GPU with more ALU and less L1 / L2 cache is limited with instruction and data and hence it is used to execute less complex tasks but when they are mostly used for same instruction on multiple ALU with different data. we can call it as Single Instruction Multiple Data (SIMD) or massively parallel instruction set.

Nature of Machine Learning model computations:

The training process in machine learning is generally time consuming with CPU involving huge computations . The process is iterative in nature with each iteration performing similar computation repeated for multiple records in the input data set. This mostly falls in the nature of massively parallel instruction set as described above.

For example a simple linear regression model ( y = mX + b )learning in machine learning involves running Gradient descent to optimize the gradients for the hyper parameters (m,b) for the input ‘X’ and output ‘y’ . This involves computing partial derivatives for each hyper parameter (m,b) for each record of the input data ‘X’ and average it for each epoch (iteration). Here the instruction set for a derivative calculation is same but has to be repeated for multiple data and this will be repeated for multiple epoch to arrive at the optimal hyper parameters for the best fitment of the model. With deep learning complex computations has been done for thousands of hyper parameters in each layer and repeated for multiple layers of architecture, the tendency of such computations extends to multiple days based on the nature of model architecture. Problems of this nature can be optimized to run in hours with massive parallel way utilizing GPU.

In my future blogs i am going to show some deeper analysis of executions with specific machine learning / deep learning models detailing the optimizations with GPU.

Finally a demonstration video of CPU versus GPU in a funny way

http://http.download.nvidia.com/nvision2008/jamie_adam/Art_Science_GPU_720p.mp4

 

 

Advertisements
Feature engineering tips for improving predictions..

Feature engineering tips for improving predictions..

One of the key factors that data analysts focuses to improve the accuracy of a machine learning output is to choose the correct predictors that is feed into the system for prediction.

Inclusion of non impacting features will unnecessarily increases the complexity of the model and in general most of the algorithms suffer from curse of dimensionality and starts underperforming towards reduced accuracy. It becomes crucial to include the appropriate predictors and as well eliminate the non-impacting features for better results.

Feature Engineering is an area in machine learning that focus on enabling the correct features for the model being developed. Below are the some of the challenges that comes to the mind of a data Analyst and statistics has way of answering to those questions.

  1. How to make sure that a feature (variable) chosen creates an impact on the prediction outcome?
  2. Will the accuracy of my model increases with eliminating the non-impacting features?
  3. How do I understand and bring the correlation between the features in the model to improve accuracy?

Let’s try to analyze these challenges and some options to address them.. I have used R-language to show case the solutions..

  1. How do I make sure that a feature (variable) chosen creates impacts the prediction outcome?

Let us take a dataset Advertising Sales data (from reference ISLR book) from media TV, Radio, Newspaper and see if these features impacts the sales.

We are trying to load the data here and trying to fit a linear prediction for sales value given the expense spent on TV, Radio, Newspaper marketing.

 

sales = read.csv(file=’Advertising.csv’,header = TRUE)
fix(sales)

sales-data

attach(sales)
lm.fit1 = (lm(Sales~TV+Radio+Newspaper,data=sales))

P value tests the null hypothesis that the coefficient is equal to zero . A low p-value ( <0.05 ) indicates that the feature makes meaningful addition to the model where a high value of p-value  shows that the feature has less impact on the result. R language has the summary command, feeding model to the command will show lot of details including p-value of each feature as below…

summary(lm.fit1)

##
## Call:
## lm(formula = Sales ~ TV + Radio + Newspaper, data = sales)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -8.8277 -0.8908  0.2418  1.1893  2.8292
##
## Coefficients:
##                           Estimate Std. Error t value    Pr(>|t|)   
## (Intercept)     2.938889   0.311908   9.422   <2e-16 ***
## TV                     0.045765   0.001395  32.809   <2e-16 ***
## Radio               0.188530   0.008611  21.893    <2e-16 ***
## Newspaper  -0.001037   0.005871  -0.177     0.86   
## —
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
##
## Residual standard error: 1.686 on 196 degrees of freedom
## Multiple R-squared:  0.8972, Adjusted R-squared:  0.8956
## F-statistic: 570.3 on 3 and 196 DF,  p-value: < 2.2e-16

The above command shows features TV, Radio has p-value less than 2e-16 where feature Newspaper has p-value of 0.86 which is very high. We can assume that newspaper feature creates the least impact to the prediction result.

  1. How to make sure that my model accuracy increases with eliminating the non-impacting features?
  • We have found just now newspaper feature is not creating impacting to the prediction result. let’s exclude newspaper and create the model , lets validate the stats of the new model without newspaper feature.Adjusted R-square value denotes the accuracy of fitment of training data to the model. We can see the R-square value improved from 0.8956 to 0.8962 with the removal of newspaper in the model fitment.

lm.fit2 = lm(Sales~TV+Radio,data=sales)
summary(lm.fit2)

##
## Call:
## lm(formula = Sales ~ TV + Radio, data = sales)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -8.7977 -0.8752  0.2422  1.1708  2.8328
##
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  2.92110    0.29449   9.919   <2e-16 ***
## TV                  0.04575    0.00139  32.909   <2e-16 ***
## Radio            0.18799    0.00804  23.382   <2e-16 ***
## —
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
##
## Residual standard error: 1.681 on 197 degrees of freedom
## Multiple R-squared:  0.8972, Adjusted R-squared:  0.8962
## F-statistic: 859.6 on 2 and 197 DF,  p-value: < 2.2e-16

Having too many non impacting features will make the model cluttered and more complex and impact the fitment of the model. It is always recommended to eliminate the non-impacting features.

  1. How features that are highly correlated impacts the prediction ?
  • Features that are highly correlated makes duplicate impact and highly influence the result. It is recommended to identify features that are highly correlated and eliminate the impact.A simple plot can help to visualize the correlation between features as below..

pairs(~TV+Radio+Newspaper,sales)

port

Variance Inflation Factor ( VIF ) defines a measure to statistically measure  correlation score among the feature columns. Smallest value of VIF ( close to 1 ) denotes complete absence of collinearity. A high value of VIF exceeding 5 or 10 denotes the presence of collinearity.
                     

                       fmsb::VIF(lm(TV~Newspaper,data=sales))

## [1] 1.003219

Based on the value of VIF , We can conclude features TV, newspaper are not correlated.

Generally it will be easy to analyze each column for feature selection when we have few features in the dataset but when the dataset is of high dimensional with 100s of features it will be difficult to do the analysis on n factorial combinations of collinearity.

Subset selection is a field in machine learning which defines best practices to for feature selection for a high dimensional dataset. Some approaches followed towards eliminating correlated feature are Principal Component Analysis ( PCA ) , Dimensionality reduction, forward selection, backward selection etc. I will detail about the process subset selection for high dimensional dataset detailed in a separate post.

Reference : The Elements of Statistical Learning

Choose your best platform for Machine Learning Solution

Choose your best platform for Machine Learning Solution

Enterprise applications trending to adopt Machine Learning as one of their strategic implementation and performing machine learning based deep analytics across multiple problem statements is becoming a common trend. There are variety of machine learning solutions / packages / platform that exist in market. One of the main challenges that the teams initially trying to resolve is to choose the correct platform / package for their solution.

Based on my experience with different machine learning solutions I thought to write this blog to list out the points (features in machine learning term) to consider while choosing a specific ML platform and list pros and cons of each of the solutions in market.

Let’s look at feature set that can be weighted before deciding a ML solution

High Level Feature Feature Set Comments
Data Storage High Storage Volume Need Ability to store huge volume of data to serve growing storage needs

 

High Availability High availability of data on partial failures

 

Data Exploration Visualizing summary tables and patterns in input data Ability to find patterns in input data, This will be helpful to understand and define features

 

 

Data Preparation / Cleansing Feature Extraction Manipulating the raw data to extract features needed for algorithm execution. This could be time consuming task when we deal with huge volume of data..

 

Distributed Execution Ability to perform the data manipulation in a distributed way , this is required when you have huge volume of data and need to reduce the time to complete. Many ML solutions are trying to bring this capability.

 

Development Supported Languages Scripting languages support for development

 

Ease of development How easy is the platform to develop scripts and execution?

 

General Purpose Programming Other than model creation , prediction , will the language support the general purpose programming needs for the application ?

 

Model Algorithms Supported Availability of different algorithm implementation packages on the platform. This is a critical requirement as we cannot switch to different products for different solutions.

 

Distributed Execution in Model Creation Model creation is a time consuming operation and needs lot of experimentation and hence the ability to create the model in a distributed way saves lot of time and will help to do experiments

 

Deep Learning Support Support for Deep learning algorithms

 

GPU Support GPU execution support will help to reduce execution by multi folds

 

Flexibility to Tune Model How flexible are the API exposing the mode parameters that can be tuned

 

Model Examination Flexibility Ability to examine the model helps to deep dive into what is happening behind the model

 

Ease in switching between Models Switch between different models for suitable choice

 

Data Visualization Visualize and Plot the results Availability of different charts to visualize the output

 

Productionizing Ease of deploying the model in production use case on web environment Run in large scale deployment

Ability to deploy the model in web

Scale to huge volume of data handling

 

Support Official / Community Support with Active development Commercial support availability for the platform / solution

Active community development

 

 

 

Now let’s look at the different machine learning solutions / platforms available in the market and where they stand with respect address the feature requirements.

Solution Language Pros Cons
RStudio R Thousands of packages for different solutions

Easy to develop

Deep Model examination and tuning

Time consuming execution due to single threaded nature.

Not easy productionizing for  web environment

 

Spark ML Scala, Python, R Scalable Machine learning library

Distributed execution utilizing platform like Yarn , Mesos etc.

Faster execution

Supports multiple languages like Scala, Python, R

 

 

New to market

Does not have exhaust list of algorithm implementation

Knowledge of Hadoop eco system

H20 Scala, Python, R Easy integration to platforms like Spark through Sparkling water , R

Connect to data from hdfs, S3, NOSQL db etc…

 

 

Compatibility between H20 and Spark with Sparkling water

No support for scala in H20 Notebooks

 

Tensorflow Python, C++ Flexible architecture that can deployed to run CPU / GPU

Effective utilization of underlying hardware.

Stronger in Deep Learning implementations

 

Learning Curve is comparatively more

Generally meant for Neural network based implementation

Matlab Matlab Advanced tool box with wide variety of algorithm implementations

Algorithms can be deployed as Java or dot net packages for deployment

 

Learning of Matlab language

Expensive product

 

Anaconda Python Good collection of algorithm implementations

Easy to learn and develop

Integration with PYSPARK

Good for local usage and trials

 

Enterprise license cost

Advanced features is licensed and expensive

 

Turi Python SFrame concept aims for distributed machine learning executions

Can read and process from HDFS, S3 etc.

Simplified machine learning executions

 

Commercial licensed product

 

IBM Watson PaaS for ML PaaS platform for Machine Learning on IBM Blue Mix

Easy integration with social, cloud

End to end solution development with limited knowledge

Easy to deploy

 

Limited control in model creation & tuning

Limited control over underlying infrastructure

 

Azure ML PaaS for ML PaaS platform for Machine Learning on Microsoft Azure

Workflow based ML solution on Azure

Easy to develop ML solutions on Azure cloud

 

Limited control in model creation & tuning

 

Limited control over underlying infrastructure

AWS ML SaaS for ML PaaS platform for Machine Learning on AWS

Easy to develop ML solutions on AWS cloud

 

Limited control in model creation & tuning

 

Limited control over underlying infrastructure

 

To summarize

Machine learning packaged solutions like RStudio, H20, Anaconda, Turi are trying to improve in the space of connecting to distributed storage platform and trying to add capabilities for distributed multi thread / core /node execution to reduce time for execution on data preparation, feature extraction  and model creation.

Machine learning PaaS solutions like IBM Watson, Azure ML, AWS ML having benefits of cloud background tries to abstract the overhead of packaging and aims for easy deployment and scalability. But these solutions limits the capabilities on the level of fine tuning the models and algorithms exposed for execution but a common man without knowledge of algorithms should able to execute.

With respect to cost and licensing most of the packaged solutions are free to run on local system with limited compute and storage capabilities , enterprise usage or when the distributed version of these solution needs comes with cost. ML solutions on cloud works with pay as use cloud pricing and service model.

Reference :

http://www.dataschool.io/python-or-r-for-data-science/

https://www.datacamp.com/community/tutorials/r-or-python-for-data-analysis#gs.CfYvf0A

https://www.continuum.io/blog/developer-blog/using-anaconda-pyspark-distributed-language-processing-hadoop-cluster

https://timchen1.gitbooks.io/graphlab/content/deployment/pipeline-dml.html

Auto-scaling scikit-learn with Apache Spark