Supervised Learning vs Unsupervised Learning

Note from Towards Data Science’s editors:_ While we allow independent authors to publish articles in accordance with our rules and guidelines, we do not endorse each author’s contribution. You should not rely on an author’s works without seeking professional advice. See our Reader Terms for details._

Nowadays, nearly everything in our lives can be quantified by data. Whether it involves search engine results, social media usage, weather trackers, cars, or sports, data is always being collected to enhance our quality of life. How do we get from all this raw data to improve the level of performance? This article will introduce us to the tools and techniques developed to make sense of unstructured data and discover hidden patterns. Specifically, the main topics that are covered are:

1. Supervised & Unsupervised Learning and the main techniques corresponding to each one (Classification and Clustering, respectively).

2. An in-depth look at the K-Means algorithm

Goals

1. Understanding the many different techniques used to discover patterns in a set of data

2. In-depth understanding of the K-Means algorithm

1.1 Unsupervised and supervised learning

In unsupervised learning, we are trying to discover hidden patterns in data, when we don’t have any labels. We will go through what hidden patterns are and what labels are, and we will go through real data examples.

What is unsupervised learning?

First, let’s step back to what learning even means. In machine learning in statistics, we are typically trying to find hidden patterns in data. Ideally, we want these hidden patterns to help us in some way. For instance, to help us understand some scientific results, to improve our user experience, or to help us maximize profit in some investment. Supervised learning is when we learn from data, but we have labels for all the data we have seen so far. Unsupervised learning is when we learn from data, but we don’t have any labels.

Let’s use an example of an email. In general, it can be hard to keep our inbox in check. We get many e-mails every day and a big problem is spam. In fact, it would be an even bigger problem if e-mail providers, like Gmail, were not so effective at keeping spam out of our inboxes. But how do they know whether a particular e-mail is a spam or not? This is our first example of a machine learning problem.

Every machine learning problem has a data set, which is a collection of data points that help us learn. Your data set will be all the e-mails that are sent over a month. Each data point will be a single e-mail. Whenever you get an e-mail, you can quickly tell whether it’s spam. You might hit a button to label any particular e-mail as spam or not spam. Now you can imagine that each of your data points has one of two labels, spam or not spam. In the future, you will keep getting emails, but you won’t know in advance which label it should have, spam or not spam. The machine learning problem is to predict whether a new label for a new email is spam or not spam. This means that we want to predict the label of the next email. If our machine learning algorithm works, it can put all the spam in a separate folder. This spam problem is an example of supervised learning. You can imagine a teacher, or supervisor, telling you the label of each data point, which is whether each e-mail is spam or not spam. The supervisor might be able to tell us whether the labels we predicted were correct.

So what is unsupervised learning? Let’s try another example of a machine learning problem. Imagine you are looking at your emails, and realize you got too many emails. It would be helpful if you could read all the emails that are on the same topic at the same time. So, you might run a machine learning algorithm that groups together similar emails. After you have run your machine learning algorithm, you find that there are natural groups of emails in your inbox. This is an example of an unsupervised learning problem. You did not have any labels because no labels were made for each email, which means there is no supervisor.

#reinforcement-learning #supervised-learning #unsupervised-learning #k-means-clustering #machine-learning

1.1 Unsupervised and supervised learning

towardsdatascience.com

Supervised Learning vs Unsupervised Learning