What do you want to create? The important question!
You have come here to use Machine Learning(ML) . Have you considered carefully what for? When you pick a Machine Learning Library, you need to start with how you are going to use it. Even if you are just interested in learning, you should consider where Machine Learning is used and which is closest to your main interest. You should also consider if you want to focus on getting something going on your local machine or if you are interested in spreading your computing over many servers.
In the beginning, start by making something work.
Where Machine Learning is used
You can find many projects that use ML, in fact so many that each category is pages long. The short version is ‘everywhere’, this is not true but one start wondering. The obvious ones are recommendations engines, image recognition and spam detection. Since you are already programming in Python, you will also be interested in The Kite code completion software. This is Other uses are to detect errors from manual data entry, medical diagnosis and maintenance for major factories and other industries
The libraries in short:
- Scikit-learn, From scikit; Routines and libraries on top of NumPy, SciPy and Matplotlib. This library is relying directly on routines on the mathematical libraries native to Python. You install scikit-learn with your regular Python package manager. Scikit-learn is small and does not support GPU calculations, this may put you of but it is a conscious choice. This package is smaller and easier to get started with. It still works pretty well in larger contexts though to make a gigantic calculations cluster, you need other packages.
- Scikit-image Special for images! Scikit-image has algorithms for image analysis and manipulation. You can use it to repair damaged images as well as manipulating colour and other attributes of the image. The main idea of this package is to make all images available to NumPy so that you can make operations on them as ndarrays. This way you have the images available as data for running any algorithms.
- Shogun: C++ base with clear API interfaces to Python, Java, Scala etc. Many, maybe most algorithms available for experimenting. This one is written in C++ for efficiency, there is also a way to try it in the cloud. Shogun uses SWIG to interface with many programming languages, including Python. Shogun covers most algorithms and is used extensively within the academic world. The package has a toolbox available at https://www.shogun-toolbox.org.
- Spark MLlib: Is mainly for Java but is available through NumPy Library for Python developers. The Spark MLlib is developed by the Apache team so it is aimed at distributed computing environments and must be run with master and workers. You can do this in standalone mode but the real power of Spark is the ability to distribute the jobs over many machines. The distributed nature of Spark makes it popular with many big companies, like IBM, Amazon and Netflix. The main purpose is to mine “Big Data”, meaning all those breadcrumbs you leave behind when you surf and shop online. If you want to work with Machine Learning, Spark MLlib is a good place to start. The algorithms it supports are spread over the full range. If you are starting a hobby project, it might not be the best idea.
- H2O: Is aimed at business processes so supports predictions for recommendations and fraud prevention. The business, H20.ai aims at finding and analysing data-sets from distributed file systems. You can run it on most conventional operating systems but the main purpose is to support cloud-based systems. It includes most statistical algorithms so can be used for most projects.
- Mahout: Is made for distributed Machine Learning algorithms. It is part of Apache due to the distributed nature of the calculations. The idea behind Mahout is for mathematicians to implement their own algorithms. This is not for a beginner, if you are just learning, you are better of using something else. Having said that, Mahout can connect to many back-ends so when you have created something look in to see if you want to use Mahout for your frontend.
- Cloudera Oryx: Mainly used for Machine Learning on real-time data. Oryx 2 is an architecture that layers all the work to create a system that can react to real-time data. The layers are also working in different time frames, with a batch layer that builds the basic model and a speed layer that modifies the model as new data is coming in. Oryx is built on top of Apache Spark and creates an entire architecture that implements all parts of an application.
- Theano: Theano is a Python Libraries which is integrated with NumPy. This is the closest to Python you can get. When you use Theano, you are advised to have gcc installed. The reason for this is that Theano can compile your code into the most appropriate code possible. While Python is great, in some cases C is faster. So Theano can convert to C and compile making your program run faster. Optionally, you can add GPU support.
- Tensorflow: The tensor in the name points to a mathematical tensor. Such a tensor has ‘n’ places in a matrix, however, a Tensor is a multi-dimensional array. TensorFlow has algorithms to make calculations for Tensors, hence the name, you can call these from Python. It is built in C and C++, but has a front-end for Python. This makes it easy to use and fast running. Tensorflow can run on CPU, GPU or distributed over networks, this is achieved by an execution engine that acts as a layer between your code and the processor.
- Matplotlib: When you have come up with a problem you can solve with Machine Learning, you will most likely want to visualise your results. This is where matplotlib comes in. It is designed to show values of any mathematical graphs and is heavily used in the academic world.
CONCLUSION
This article has given you an idea about what is available to program in Machine Learning. To get a clear picture of what you need, you must start by making a few programs and see how they work. Not until you know how things can be done, can you find the perfect solution for your next project.