Collaborative filtering is a tool that companies are increasingly using. Netflix uses it to recommend shows for you to watch. Facebook uses it to recommend who you should be friends with. Spotify uses it to recommend playlists and songs. It’s incredibly useful in recommending products to customers.

In this post, I construct a collaborative filtering neural network with embeddings to understand how users would feel towards certain movies. From this, we can recommend movies for them to watch.

The dataset is taken from here. This code is loosely based off the fastai notebook.

First, let get rid of the annoyingly complex user ids. We can make do with plain old integers. They’re much easier to handle.

import pandas as pd
ratings = pd.read_csv('ratings.csv')
movies = pd.read_csv('movies.csv')

Then we’ll do the same thing for movie ids as well.

u_uniq = ratings.userId.unique()
user2idx = {o:i for i,o in enumerate(u_uniq)}
ratings.userId = ratings.userId.apply(lambda x: user2idx[x])

We’ll need to get the number of users and the number of movies.

n_users=int(ratings.userId.nunique())
n_movies=int(ratings.movieId.nunique())

First, let’s create some random weights. We need to call. This allows us to avoid calling the base class explicitly. This makes the code more maintainable.

These weights will be uniformly distributed between 0 and 0.05. The _ operator at the end of uniform_ denotes an inplace operation.

class EmbeddingDot(nn.Module):
		def __init__(self, n_users, n_movies):
			super().__init__()
			self.u = nn.Embedding(n_users, n_factors)
			self.m = nn.Embedding(n_movies, n_factors)
			self.u.weight.data.uniform_(0,0.05)
			self.m.weight.data.uniform_(0,0.05)
view raw
embedding_matrices.py hosted with ❤ by GitHub

Next, we add our Embedding matrices and latent factors.

We’re creating an embedding matrix for our user ids and our movie ids. An embedding is basically an array lookup. When we multiply our one-hot encoded user ids by our weights most calculations cancel to 0 (0 * number = 0). All we’re left with is a particular row in the weight matrix. That’s basically just an array lookup.

So we don’t need the matrix multiply and we don’t need the one-hot encoded array. Instead, we can just do an array lookup. This reduces memory usage and speeds up the neural network. It also reveals the intrinsic properties of the categorical variables. This idea was applied in a recent Kaggle competition and achieved 3rd place.

The size of these embedding matrices will be determined by n_factors. These factors determine the number of latent factors in our dataset.

#machine-learning #collaborative-filtering #deep-learning #pytorch #deep learning

Collaborative Filtering in Pytorch
9.75 GEEK