When it comes to machine learning classification tasks, the more data available to train algorithms, the better. In supervised learning, this data must be labeled with respect to the target class — otherwise, these algorithms wouldn’t be able to learn the relationships between the independent and target variables. However, there are a couple of issues that arise when building large, labeled data sets for classification:

  1. Labeling data can be time-consuming. Let’s say we have 1,000,000 dog images that we want to feed to a classification algorithm, with the goal of predicting whether each image contains a Boston Terrier. If we want to use all of those images for a supervised classification task, we need a human to look at each image and determine whether a Boston Terrier is present. While I do have friends (and a wife) who wouldn’t mind scrolling through dog pictures all day, it probably isn’t how most of us want to spend our weekend.
  2. **Labeling data can be expensive. **See reason 1: to get someone to painstakingly scour 1,000,000 dog pictures, we’re probably going to have to shell out some cash.

So, what if we only have enough time and money to label some of a large data set, and choose to leave the rest unlabeled? Can this unlabeled data somehow be used in a classification algorithm?

This is where semi-supervised learning comes in. In taking a semi-supervised approach, we can train a classifier on the small amount of labeled data, and then use the classifier to make predictions on the unlabeled data. Since these predictions are likely better than random guessing, the unlabeled data predictions can be adopted as ‘pseudo-labels’ in subsequent iterations of the classifier. While there are many flavors of semi-supervised learning, this specific technique is called self-training.

Self-Training

Image for post

Self-Training

On a conceptual level, self-training works like this:

Step 1: Split the labeled data instances into train and test sets. Then, train a classification algorithm on the labeled training data.

**Step 2: **Use the trained classifier to predict class labels for all of the unlabeled data instances. Of these predicted class labels, the ones with the highest probability of being correct are adopted as ‘pseudo-labels’.

(A couple of variations on Step 2: a) All of the predicted labels can be adopted as ‘pseudo-labels’ at once, without considering probability, or b) The ‘pseudo-labeled’ data can be weighted by confidence in the prediction.)

**Step 3: **Concatenate the ‘pseudo-labeled’ data with the labeled training data. Re-train the classifier on the combined ‘pseudo-labeled’ and labeled training data.

Step 4: Use the trained classifier to predict class labels for the labeled test data instances. Evaluate classifier performance using your metric(s) of choice.

(Steps 1 through 4 can be repeated until no more predicted class labels from Step 2 meet a specific probability threshold, or until no more unlabeled data remains.)

Ok, got it? Good! Let’s work through an example.

Example: Using Self-Training to Improve a Classifier

To demonstrate self-training, I’m using Python and the surgical_deepnet data set, available here on Kaggle. This data set is intended to be used for binary classification, and contains data for 14.6k+ surgeries. The attributes are measurements like bmi, age, and a variety of others, while the target variable, complication, records whether the patient suffered complications as a result of surgery. Clearly, being able to accurately predict whether a patient will suffer complications from a surgery would be in the best interest of healthcare and insurance providers alike.

#semi-supervised-learning #machine-learning #python #data-science #classification

A Gentle Introduction to Self-Training and Semi-Supervised Learning
1.55 GEEK