SkLearn 101

Scikit-learn for Python is a library for machine learning. It has many algorithms for regression, classification, and clustering, including SVMs, gradient boost, k-means, random forests, and DBSCAN. It is planned to work with Numpy and SciPy in Python.

As a Google Summer of Code (also known as GSoC) project by David Cournapeau, the scikit-learn project started as scikit. learn. It gets its name from a different third-party extension to SciPy, “Scikit.”

Python Scikit-learn

Scikit (most of it) is written in Python and some of its main algorithms are written for even better results in Cython.

Scikit-learn is used to construct models and, as there are better frameworks available for the purpose, it is not recommended to use it for reading, manipulating, and summarizing data.

It is open source and is licensed under BSD.

Scikit-learn (Sklearn) is the most useful and robust library for machine learning in Python.

It provides a selection of efficient tools for machine learning and statistical modeling including classification, regression, clustering, and dimensionality reduction via a consistent interface in Python. This library, which is largely written in Python, is built upon NumPy, SciPy, and Matplotlib.

Scikit-learn comes loaded with a lot of features. Here are a few of them to help you understand the spread:

Supervised learning algorithms: Think of any supervised machine learning algorithm you might have heard about and there is a very high chance that it is part of scikit-learn. Starting from Generalized linear models (e.g Linear Regression), Support Vector Machines (SVM), Decision Trees to Bayesian methods – all of them are part of scikit-learn toolbox. The spread of machine learning algorithms is one of the big reasons for the high usage of scikit-learn. I started using scikit to solve supervised learning problems and would recommend that to people new to scikit / machine learning as well.
Cross-validation: There are various methods to check the accuracy of supervised models on unseen data using sklearn.
Unsupervised learning algorithms: Again there is a large spread of machine learning algorithms in the offering – starting from clustering, factor analysis, principal component analysis to unsupervised neural networks.
Various toy datasets: This came in handy while learning scikit-learn. I had learned SAS using various academic datasets (e.g. IRIS dataset, Boston House prices dataset). Having them handy while learning a new library helped a lot.
Feature extraction: Scikit-learn for extracting features from images and text (e.g. Bag of words)

Install Scikit Learn

Scikit assumes that you have a Python 2.7 or above framework running on your computer with NumPY (1.8.2 and above) and SciPY (0.13.3 and above) packages. We will continue with the installation once we have these packages installed.

For pip installation, in the terminal, run the following command:

pip install scikit-learn

import sklearn

Scikit Learn Loading Dataset

Let’s begin by loading a dataset with which to play.

Let’s load a straightforward dataset called Iris. It is a flower dataset and includes 150 observations of various measurements of the flower.

Using scikit-learn, let’s see how to load the dataset.

# Import scikit learn
from sklearn import datasets
# Load data
iris= datasets.load_iris()
# Print shape of data to confirm data is loaded
print(iris.data.shape)

gives us: (150,4)

Scikit Learn SVM – Learning and Predicting

Now that we have the data loaded, let’s try to learn from it and predict new data. We have to construct an estimator for this reason and then call its method of fit.

from sklearn import svm
from sklearn import datasets
# Load dataset
iris = datasets.load_iris()
clf = svm.LinearSVC()
# learn from the data
clf.fit(iris.data, iris.target)
# predict for unseen data
clf.predict([[ 5.0,  3.6,  1.3,  0.25]])
# Parameters of model can be changed by using the attributes ending with an underscore
print(clf.coef_ )

Here is what we get when we run this script:

Scikit Learn Linear Regression

Creating various models is rather simple using scikit-learn. Let’s start with a simple example of regression.

Now that we have the data loaded, let’s try to learn from it and predict new data. We have to construct an estimator for this reason and then call its method of fit.

#import the model
from sklearn import linear_model
reg = linear_model.LinearRegression()
# use it to fit a data
reg.fit ([[0, 0], [1, 1], [2, 2]], [0, 1, 2])
# Let's look into the fitted data
print(reg.coef_)

gives us: [0.5 0.5]

kNN

Let’s try a simple classification algorithm.

from sklearn import datasets
# Load dataset
iris = datasets.load_iris()
# Create and fit a nearest-neighbor classifier
from sklearn import neighbors
knn = neighbors.KNeighborsClassifier()
knn.fit(iris.data, iris.target)
# Predict and print the result
result=knn.predict([[0.1, 0.2, 0.3, 0.4]])
print(result)

gives us: [0]

K-means clustering

This is the simplest clustering algorithm. The set is divided into ‘k’ clusters and each observation is assigned to a cluster. This is done iteratively until the clusters converge.

from sklearn import cluster, datasets
# load data
iris = datasets.load_iris()
# create clusters for k=3
k=3
k_means = cluster.KMeans(k)
# fit data
k_means.fit(iris.data)
# print results
print( k_means.labels_[::10])
print( iris.target[::10])

Ending Note

If you liked reading this article and want to read more, continue to follow the site! We have a lot of interesting articles upcoming in the near future. If you are new to any of these concepts, we recommend you take up tutorials concerning these topics.