With the growing value of big data and machine learning, Data Science attracted interest from professionals of various areas of expertise. You are one of these professionals, and then you studied linear algebra, calculus, probabilities, machine learning, and now you want to put this knowledge in practice.
All you want to do is to load some small data, perform some exploration, create some visualization, and train a simple model. Then you go to the Internet searching for the right tool to start your brand new data science project, and you find a lot of options. You install new software, libraries, and spend some time reading tutorials. But you still can’t decide which tool to use.
In the next sessions, we help you with this decision by listing five reasons that make Google Colab the right tool for beginner data scientists.
Machine Learning is a very hot topic these days. Companies from different sectors and sizes, from startups to enterprises, are boosting their products and services with the help of machine learning.
As the demand for machine learning engineers grows, many experienced developers want to get to the field as fast as they can.
But as soon as they buy their first courses and books, they got frustrated. Instead of code and working examples, they face a bunch of math and statistics, topics that they haven’t applied since graduation.
Besides math, another source of frustration is the necessary software to build machine learning models. A few years ago, you had to install specialized software, buy powerful hardware and sometimes implement your algorithms from scratch.
The good news is that none of these is necessary to get started in machine learning. Current machine learning libraries abstract most of the math and algorithms you need so you can concentrate on the data flow instead of implementation details. Also, a web browser is everything you need to start creating your first machine learning models.
In this tutorial, we are going to create our first machine learning model using the most famous Python libraries and the Google Colab environment, so we don’t have to waste any time installing and configuring new software.
In this tutorial, we use the Google Colab tool to create Python notebooks. A notebook is a special file in which we can mix formatted text and Python code so we can create a rich documentation for our machine learning experiments. Also, we can plot charts directly to the notebook.
In this tutorial we are going to build a machine learning model to determine to which species an Iris flower belongs to. The Iris dataset contains tabular data about characteristics of the flower exemplars like petal width and length, which are used as input to the model. The output will be an integer indicating representing one of the 3 possibilities of species: Iris Setosa, Iris Versicolour or Iris Virginica. The next sections will describe the construction of the model, step by step.
In this section we import the necessary libraries so you can build your model.
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
from joblib import dump, load
After loading the dataset into a dataframe in memory, the next step is to perform an exploratory data analysis. The objective of the EDA is to discover as much information as possible about the dataset. The describe() method is a good starting point. The describe() method prints statistics of the dataset, like mean, standard deviation, etc.
A very important tool in exploratory data analysis is data visualization, which helps us to gain insights about the dataset. The plot below shows the relationship between the attributes of the dataset.
Another interesting case of data visualization is use a heatmap to visualize the correlation matrix of the dataset.
3. Preprocess the data
Frequently, the dataset collected from databases, files or scraping the internet is not ready to be consumed by a machine learning algorithm. In most cases, the dataset needs some kind of preparation or preprocessing before being used as input to a machine learning algorithm. In this case, we convert the string values of the class column to integer numbers because the algorithm we are going to use does not process string values.
df['class_encod'] = df['class'].apply(lambda x: 0 if x == 'Iris-setosa' else 1 if x == 'Iris-versicolor' else 2)
4. Select an algorithm and train the model
After exploring and preprocessing our data we can build our machine learning model to classify Iris specimens. So, the first step is to split our dataframe in input attributes and target attributes.
y = df[['class_encod']] # target attributes
X = df.iloc[:, 0:4] # input attributes
If in the previous step we splitted the dataframe by separating columns, in this step we split the data by rows. The method train_test_split() will split the X and y dataframes in training data and test data.
Then we use the datasets X_train and y_train to build a KNN classifier, using the KNeighborsClassifier class provided by scikit-learn. Because the machine learning algorithm is already implemented by the library, all we have to do is call the method fit() passing the X_train and y_train datasets as arguments.
m = KNeighborsClassifier()
Once the model is built, we can use the predict() method to calculate the predicted category of an instance. In this case, we want to predict the class of the first 10 lines of the X_test dataset. The return is an array containing the estimated categories.
We can use methods like score() and confusion_matrix() to measure the performance of our model. We see that the accuracy of our model is 1.0 (100%), which means that the model predicted correctly all cases of the test dataset.
Finally, we want to save our model for later use. For example, we could embed our model into a webservice or mobile application. So we use the method dump() from the joblib package to save the model to a file.
ic = load('iris-classifier.dmp')
Although the simplicity of the Iris dataset, the steps demonstrated in this tutorial can be reproduced for practically any other dataset. The effort required on each step may vary, but the process is basically the same, so you can start practicing with simpler datasets and increase the complexity of your projects incrementally.
So, click the link below to open a ready to use template in Google Colab and create your first machine learning model right now:
The Markov Decision Process is the formal description of the Reinforcement Learning problem. It includes concepts like states, actions, rewards, and how an agent makes decisions based on a given policy. So, what Reinforcement Learning algorithms do is to find optimal solutions to Markov Decision Processes.
In this article, we are going to learn how to create and explore the Frozen Lake environment using the Gym library, an open source project created by OpenAI used for reinforcement learning experiments. The Gym library defines a uniform interface for environments what makes the integration between algorithms and environment easier for developers. Among many ready-to-use environments, the default installation includes a text-mode version of the Frozen Lake game, used as example in our last post.
Let’s understand how Reinforcement Learning works through a simple example. Let’s play a game called The Frozen Lake. Suppose you were playing frisbee with your friends in a park during winter. One of you threw the frisbee so far that it has dropped in a frozen lake. Your mission is to walk over the frozen lake to get the frisbee back, but taking caution to not fall in a hole of freezing water.
Over the past few years, we’ve seen computer programs winning games which we believe humans were unbeatable. This belief held considering this games had so many possible moves for a given position that would be impossible to computer programs calculate all of then and choose the best ones. However, in 1997 the world witnessed what otherwise was considered impossible: the IBM Deep Blue supercomputer won a six game chess match against Gary Kasparov, the world champion of that time, by 3.5 – 2.5. Such victory would only be achieved again when DeepMind’s AlphaGo won a five game Go match against Lee Sedol, 18 times world champion, by a 4-1 score.