What is classification?

Classification is one of two types of supervised machine learning tasks with the other being regression.

Key point to remember: supervised learning tasks use features to predict targets, or, in non-tech speak, they use attributes to predict something. For instance, we can take a basketball player's height, weight, age, foot-speed, and/or multiple other aspects to predict how many points they'll score or whether they will be an all-star.

So what's the difference between the two?

Regression tasks predict a continuous value
- For example, how many points someone will score.
Classification tasks predict a non-continuous value.
- For example, whether a player will be an all-star.

How do I know which technique to use?

Answer the following question: "Does my target variable have an order to it?"

For example, my project predicting the recommended age of a reader was a regression task because I was predicting a precise age (i.e., 4 years old). If I was attempting to identify whether a book was suitable for teens or not, then it would have been a classification task since the answer would have been either yes or no.

OK, so classification is only for yes/no, true/false, cat/dog problems, right?

Nope, those are just the easy examples

Example 1: Sorting People into Groups

Imagine a scenario where you get a new batch of students every year and have to sort them into houses based on their personality traits.

In this situation, the houses do not have any type of sequence/ranking to them. Sure, Harry definitely didn't want to be housed in Slytherin, and the Sorting Hat clearly took that into consideration, but that doesn't mean Slytherin is closer to Gryffindor like 25 is closer to 30 than it is to 19.

Example 2: Applying Labels

Similarly, if we had a data set containing the ingredients of dishes and attempted to predict the country of origin, we'd be solving a classification problem. Why? Because country names have no numerical order. We can say that Russia is the largest country on earth or that China has the most people but those are attributes of the country (i.e., land size and population) which are not intrinsic to the name of the country.

Got it! Numbers are regression, words are classification

Sorry but no.

Think back to the book recommendation project I mentioned earlier. Answer this: if I wanted to predict whether a book was for:

small children (2-5 years old)
primary children (6 - 10)
tweens (11 -12)
young adults (13 - 17)
adults (18 +)

what would you use: regression or classification?

Well, given that the labels have a clear order, you'd definitely want to treat this as a regression problem by coding 'small children' as '1' and 'adults' as '5'¹.

In Sum

Always start by asking 'What am I attempting to predict?' Once that question is solved the rest becomes far simpler because as Jung said, "To ask the right question, is already half the problem solved."

C is for Classification