If you are a Python programmer or you are looking for a robust library you can use to bring machine learning into a production system then a library that you will want to seriously consider is scikit-learn.
In this post you will get an overview of the scikit-learn library and useful references of where you can learn more.
Scikit-learn was initially developed by David Cournapeau as a Google summer of code project in 2007. Later Matthieu Brucher joined the project and started to use it as apart of his thesis work. In 2010 INRIA got involved and the first public release (v0.1 beta) was published in late January 2010.
The project now has more than 30 active contributors and has had paid sponsorship fromINRIA, Google, Tinyclues and the Python Software Foundation.
Scikit-learn provides a range of supervised and unsupervised learning algorithms via a consistent interface in Python. It is licensed under a permissive simplified BSD license and is distributed under many Linux distributions, encouraging academic and commercial use.
The library is built upon the SciPy (Scientific Python) that must be installed before you can use scikit-learn. This stack that includes:
Extensions or modules for SciPy care conventionally named SciKits. As such, the module provides learning algorithms and is named scikit-learn.
The vision for the library is a level of robustness and support required for use in production systems. This means a deep focus on concerns such as easy of use, code quality, collaboration, documentation and and performance.
Although the interface is Python, c-libraries are leverage for performance such as numpy for arrays and matrix operations, LAPACK, LibSVM and the careful use of cython.
The library is focused on modeling data. It is not focused on loading, manipulating and summarizing data. For these features, refer to NumPy and Pandas.
Some popular groups of models provided by scikit-learn include:
I want to give you an example to show you how easy it is to use the library.
In this example we use the Classification and Regression Tress (CART) decision tree algorithm to model the Iris flower dataset/ This dataset is provided as an example dataset with the library and is loaded. The classifier is fit on the data and then predictions are made on the training data. Finally, the classification accuracy and a confusion matrix is printed.
# Decision Tree Classifier | |
from sklearn import datasets | |
from sklearn import metrics | |
from sklearn.tree import DecisionTreeClassifier | |
|
|
# load the iris datasets | |
dataset = datasets.load_iris() | |
# fit a CART model to the data | |
model = DecisionTreeClassifier() | |
model.fit(dataset.data, dataset.target) | |
print(model) | |
# make predictions | |
expected = dataset.target | |
predicted = model.predict(dataset.data) | |
# summarize the fit of the model | |
print(metrics.classification_report(expected, predicted)) | |
print(metrics.confusion_matrix(expected, predicted)) |
The scikit-learn testimonials page lists Inria, Mendeley, wise.io , Evernote, Telecom ParisTech and AWeber as users of the library. If this is a small indication of companies that have presented on their use, then there are very likely tens to hundreds of larger organizations using the library.
It has good test coverage and managed releases and is suitable for prototype and production projects alike.
If you are interested in learning more, checkout the Scikit-Learn homepage that includes documentation and related resources. You can get the code from the github repository, and releases are historically available on the Sourceforge project.
I recommend starting out with the quick-start tutorial and flicking through the user guide and example gallery for algorithms that interest you. Ultimately, scikit-learn is a library and the API reference will be the best documentation for getting things done.
If you interested in more information about how the project started and it’s vision, there are some papers you may want to check-out.
If you are looking for a good book, I recommend “Building Machine Learning Systems with Python”. It’s well written and the examples are interesting.