In this story, we have a specific goal: given the plot of a movie, how can we discover its soul? How can we locate the most meaningful words, which make us want to invest our time in watching a film or a TV series? What concepts tend to resonate with us and which symbols get us on the picture’s wavelength?

In this article, we explore how to use Non-negative Matrix Factorization (NMF) to model the genres or topics of a set of movies. Moreover, we determine the most important words for each genre and come up with latent features that describe each film and can be to solve other problems via transfer learning.


What is NMF?

NMF is a matrix factorization technique, much like Singular Value Decomposition (SVD), but we constrain the resulting matrices to be non-negative instead of orthogonal. Given a matrix X, we want to decompose into two matrices W and H, so that X ≈ W x H. We use the approximately equal sign  because unlike SVD, NMF yields an approximation of the original matrix. Moreover, every element of W and H is non-negative.

If matrix X holds images of faces, W would capture the facial features of those faces and H the relative importance of those features in each image

Intuitively, we can think the two matrices as follows: Assume that we have our matrix X in which every column is a vectorized image of a face. W expresses the facial features (e.g. noses, eyebrows, facial hair, etc.) and H captures the relative importance of features in each image.

Image for post

Now that we have a perspective about what NMF accomplishes, we are ready to get our hands dirty. But first, we take another brief diversion into word normalization and TF-idf.

Density vs Significance

_Term Frequency — inverse document frequency _(TF-idf)is a weight measure often used in information retrieval. Its role is to measure how important is a word to a document in a corpus.

TF-idf is composed of two terms. The first one, Term Frequency, computes the normalized frequency of a word appearing in a document. Thus, the number of times a word appears in a document, divided by the total number of words in that document.

TF measures how frequently a word occurs in a document, while idf computes how important a word is

The second term is the Inverse Document Frequency (idf). This is computed as the logarithm of the number of the documents in the corpus divided by the number of documents where the specific term appears. Thus, from the first term, we get the frequency of words, while the second one provides us with the importance of each word by weighting down the frequent words while scaling up the rare ones.

Back to Our Task

Now that we have a solid understanding of NMF and TF-idf, we are ready to apply ourselves to the problem at hand; how do we uncover more from just a movie plot?

#artificial-intelligence #technology #programming #movies #data-science #data analysis

Read a Movie’s Plot Like a Data Scientist
1.15 GEEK