Predictive modeling with Machine Learning in R — Part 1 (Introduction)
“The latest Netflix series is not being made because a producer had a divine inspiration or a moment of lucidity, but because a data model says it will work.” — Enrique Dans, IE Business School
What is the post about?
This post is the first in a series of posts where I plan to introduce types of machine learning, applying machine learning algorithms on real datasets, and how to evaluate the algorithm(s). In a nutshell, I want my posts to act like a 101 course for Machine Learning. I have chosen R as my tool of choice for these posts, but I plan to do a similar series for Python as well.
Just a word of caveat — my posts will be a practical-oriented course and limited theory about the algorithms that I will be using. If you would like to understand how machine learning works visually, I strongly recommend visiting this website created by Stephanie yee and Tony Chu.
Objectives/Prerequisites
- This series of posts aims to provide a newbie in ML with accessible content, reproducible codes, tips/tricks to get started on one’s journey in ML.
- At the end of my last post in this series, I would expect this newbie to apply the learnt ML algorithms on other datasets and also explore new algorithms in this realm.
- Familiarity with R/RStudio syntax — how to import data into R, handle data, manipulate, plot, etc. Please refer to my series of posts on Data Analysis with R for a refresher on this topic.
Introduction
In this day and age of Big Data, there is a keen interest in Data Science, Artificial Intelligence, and Machine Learning. In this post, I will address the basic questions (what, why, and how) on machine learning before moving on to the actual coding of machine learning algorithms in the subsequent posts. Predictive modeling is one of the most important use cases of machine learning and it is used everywhere — weather forecasting, retail marketing, web search, risk prediction of disease, etc.
A. What is Machine Learning?
Artificial Intelligence, Machine Learning, Deep Learning, and Data Science have been frequently interchangeably used. Let’s see the difference, as illustrated in the picture below, and also understand what is machine learning.
- Artificial Intelligence (AI)— is the science and engineering of making computers behave in ways that, until recently, we thought required human intelligence. ‘Until recently’ is a critical word here as fifty years ago, a chess-playing program was considered a form of AI whereas nowadays, a chess game can be found on almost every mobile phone or computer. In today’s world, I would consider Google Home or Amazon’s Alexa to be an example of AI that answers our queries and recommends suggestions based on its machine learning powered algorithms.
Machine Learning (ML) — is the study of computer algorithms that improve automatically through experience. In other words, it is the study of computer algorithms that improve automatically through experience
- Deep Learning (DL) — is a subset of machine learning. DL automatically finds out the features which are important for classification, whereas Machine Learning requires the user to provide
- Data Science — science is the extraction of relevant insights from data using various techniques from mathematics, machine learning, computer programming, statistical modelling, data engineering and visualization, pattern recognition and learning, uncertainty modelling, data warehousing, and cloud computing
B. Why Machine Learning (ML)?
In order to answer this question, we need to understand how conventional programming (CP)works first, how is ML better than CP. Let’s use the famous FizzBuzz game to understand the differences. This problem takes an input number and tries to divide by 3 and 5.
- if the number is divisible by 3 then it prints ‘fizz’,
- if it is divisible by 5 then it prints out ‘buzz’
- if it is divisible by both then it prints out ‘fizzbuzz’ and
- if it is not divisible by any of the 3 or 5 then print ‘other’
Conventional programming approach
Extremely straightforward as we have only 4 scenarios to work with. A python code to solve this problem is as follows.
def fizzbuzz(n):
if n % 3 == 0 and n % 5 == 0:
return 'FizzBuzz
elif n % 3 == 0:
return 'Fizz
elif n % 5 == 0:
return 'Buzz
else:
return 'Other'
Machine learning approach
Suppose, we already have a lot of numbers whose output is already known i.e., whether it is ‘fizz’ or ‘buzz’ or ‘fizzbuzz’. All we need to do is
- write a machine learning code and feed (train) the available data.
- then verify whether we have successfully created a model by verifying (test) with unseen data.
A powerful code using Google Tensorflow could achieve an accuracy of 98%, based on 5000 iterations.
NetFlix scenario
Now, let’s imagine the following scenario of being the boss of Netflix which has millions of customers. Each customer has a unique preference. How do we get a solution and scale it to all of those millions of customers?
For conventional programming, this task becomes increasingly difficult, because,
- You don’t know what all things determine the watching habits of a person.
- Even if you know, the solution will not scale to millions of users at a time because for each person you have to write a separate solution based on his/her habits.
So, this is Why Machine Learning.
The two premises that ML will be trained on are
- Based on your past data, which are movies you are most likely going to watch?
- What are people like you watching these days?
I hope the quote from Prof Enrique, right on the top of this article, makes sense to you now.
C. How to do Machine Learning?
This series of posts is meant to answer this question. We shall see from the next post, how to build machine learning models on real datasets to predict an outcome. Briefly, machine learning can be applied in one of the three ways illustrated in the following picture. In this series of posts, we will mostly focus on supervised and unsupervised learning.
Conclusion
In this post, I’ve introduced to you the basics of ML — what, why, and how. We also differentiated the commonly interchangeable terms like artificial intelligence, data science, and machine learning. We have set the platform to dive into the world of machine learning to learn how to apply the ML algorithms on real datasets in R.