The Natural Language Processing (NLP) group at Stanford University made publicly available the list of papers from their CS 384 seminar on Ethics and Social Issues in Natural Language Processing, and so I have been on a bit of a reading binge trying to learn more about this fascinating and important topic.

In this article, I want to explore the use of analogies for identifying biases in word embeddings by focusing on two papers on the topic: Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings (2016) [1] and Fair Is Better than Sensational: Man Is to Doctor as Woman Is to Doctor (2020) [2]. The first, which I will refer to as “the paper on debiasing,” is from the Stanford NLP list, and the second, referred to as “the paper on fairness,” is available through the Computational Linguistics journal of MIT Press Journals.

But first things first.

What is a word embedding?

A word embedding is a vector representation of a word that can be used to convey the meaning of the word to a computer. Therefore, with a word embedding, an algorithm can take as input a numerical representation of a word, rather than simply relying on counts of words.

Word embeddings have been researched in some depth for machine learning applications, probably because they have some interesting (and perhaps unexpected) properties: (1) semantically similar words tend to have vectors that are close to each other in the vector space, and (2) the differences between word embeddings tend to produce vectors representing the difference in meaning between words (e.g., king − man + woman = queen).

Such differences between words can also be described using analogies (e.g., man is to (:) king as (::) woman is to (:) queen), and it seems as if many researchers have had their share of fun using word embeddings to fill in analogies (e.g., man : king :: woman : x). However, while it may be interesting to set up an analogy to see which word is selected by the algorithm to replace X, such research can find dangerous biases that exist within our language.

#articial-intelligence #ai #language #data-science #nlp

On gender bias in word embeddings
1.20 GEEK