Choose your best platform for Machine Learning Solution

Enterprise applications trending to adopt Machine Learning as one of their strategic implementation and performing machine learning based deep analytics across multiple problem statements is becoming a common trend. There are variety of machine learning solutions / packages / platform that exist in market. One of the main challenges that the teams initially trying to resolve is to choose the correct platform / package for their solution.

Based on my experience with different machine learning solutions I thought to write this blog to list out the points (features in machine learning term) to consider while choosing a specific ML platform and list pros and cons of each of the solutions in market.

Let’s look at feature set that can be weighted before deciding a ML solution

High Level Feature	Feature Set	Comments
Data Storage	High Storage Volume Need	Ability to store huge volume of data to serve growing storage needs
Data Storage	High Availability	High availability of data on partial failures
Data Exploration	Visualizing summary tables and patterns in input data	Ability to find patterns in input data, This will be helpful to understand and define features
Data Preparation / Cleansing	Feature Extraction	Manipulating the raw data to extract features needed for algorithm execution. This could be time consuming task when we deal with huge volume of data..
Data Preparation / Cleansing	Distributed Execution	Ability to perform the data manipulation in a distributed way , this is required when you have huge volume of data and need to reduce the time to complete. Many ML solutions are trying to bring this capability.
Development	Supported Languages	Scripting languages support for development
	Ease of development	How easy is the platform to develop scripts and execution?
	General Purpose Programming	Other than model creation , prediction , will the language support the general purpose programming needs for the application ?
Model	Algorithms Supported	Availability of different algorithm implementation packages on the platform. This is a critical requirement as we cannot switch to different products for different solutions.
	Distributed Execution in Model Creation	Model creation is a time consuming operation and needs lot of experimentation and hence the ability to create the model in a distributed way saves lot of time and will help to do experiments
	Deep Learning Support	Support for Deep learning algorithms
	GPU Support	GPU execution support will help to reduce execution by multi folds
	Flexibility to Tune Model	How flexible are the API exposing the mode parameters that can be tuned
	Model Examination Flexibility	Ability to examine the model helps to deep dive into what is happening behind the model
	Ease in switching between Models	Switch between different models for suitable choice
Data Visualization	Visualize and Plot the results	Availability of different charts to visualize the output
Productionizing	Ease of deploying the model in production use case on web environment	Run in large scale deployment Ability to deploy the model in web Scale to huge volume of data handling
Support	Official / Community Support with Active development	Commercial support availability for the platform / solution Active community development

Now let’s look at the different machine learning solutions / platforms available in the market and where they stand with respect address the feature requirements.

Solution	Language	Pros	Cons
RStudio	R	Thousands of packages for different solutions Easy to develop Deep Model examination and tuning	Time consuming execution due to single threaded nature. Not easy productionizing for web environment
Spark ML	Scala, Python, R	Scalable Machine learning library Distributed execution utilizing platform like Yarn , Mesos etc. Faster execution Supports multiple languages like Scala, Python, R	New to market Does not have exhaust list of algorithm implementation Knowledge of Hadoop eco system
H20	Scala, Python, R	Easy integration to platforms like Spark through Sparkling water , R Connect to data from hdfs, S3, NOSQL db etc…	Compatibility between H20 and Spark with Sparkling water No support for scala in H20 Notebooks
Tensorflow	Python, C++	Flexible architecture that can deployed to run CPU / GPU Effective utilization of underlying hardware. Stronger in Deep Learning implementations	Learning Curve is comparatively more Generally meant for Neural network based implementation
Matlab	Matlab	Advanced tool box with wide variety of algorithm implementations Algorithms can be deployed as Java or dot net packages for deployment	Learning of Matlab language Expensive product
Anaconda	Python	Good collection of algorithm implementations Easy to learn and develop Integration with PYSPARK Good for local usage and trials	Enterprise license cost Advanced features is licensed and expensive
Turi	Python	SFrame concept aims for distributed machine learning executions Can read and process from HDFS, S3 etc. Simplified machine learning executions	Commercial licensed product
IBM Watson	PaaS for ML	PaaS platform for Machine Learning on IBM Blue Mix Easy integration with social, cloud End to end solution development with limited knowledge Easy to deploy	Limited control in model creation & tuning Limited control over underlying infrastructure
Azure ML	PaaS for ML	PaaS platform for Machine Learning on Microsoft Azure Workflow based ML solution on Azure Easy to develop ML solutions on Azure cloud	Limited control in model creation & tuning Limited control over underlying infrastructure
AWS ML	SaaS for ML	PaaS platform for Machine Learning on AWS Easy to develop ML solutions on AWS cloud	Limited control in model creation & tuning Limited control over underlying infrastructure

To summarize

Machine learning packaged solutions like RStudio, H20, Anaconda, Turi are trying to improve in the space of connecting to distributed storage platform and trying to add capabilities for distributed multi thread / core /node execution to reduce time for execution on data preparation, feature extraction and model creation.

Machine learning PaaS solutions like IBM Watson, Azure ML, AWS ML having benefits of cloud background tries to abstract the overhead of packaging and aims for easy deployment and scalability. But these solutions limits the capabilities on the level of fine tuning the models and algorithms exposed for execution but a common man without knowledge of algorithms should able to execute.

With respect to cost and licensing most of the packaged solutions are free to run on local system with limited compute and storage capabilities , enterprise usage or when the distributed version of these solution needs comes with cost. ML solutions on cloud works with pay as use cloud pricing and service model.

Reference :

http://www.dataschool.io/python-or-r-for-data-science/

https://www.datacamp.com/community/tutorials/r-or-python-for-data-analysis#gs.CfYvf0A

https://www.continuum.io/blog/developer-blog/using-anaconda-pyspark-distributed-language-processing-hadoop-cluster

https://timchen1.gitbooks.io/graphlab/content/deployment/pipeline-dml.html

Auto-scaling scikit-learn with Apache Spark

Uncategorized

Zephyr

Choose your best platform for Machine Learning Solution

Leave a comment Cancel reply

Share this:

Leave a comment Cancel reply