Deep Learning Model Implementation: Embeddings for Categorical Variables. Using embeddings to explore US house prices and migration patterns

For many complex prediction problems, Deep Learning (DL) methodologies such as CNNs, NLP, and fully-connected networks offer the highest levels of performance. This typically comes at the cost of understanding the role of the explanatory features in prediction outcomes. While Machine Learning (ML) methodologies based on statistical methods offer a variety of techniques including variable selection, relative importance and in some cases model coefficients to understand the role of predictors (as discussed in Part 1 of this series), the potential to achieve similar insights on the DL side has been less explored. And, due to the complexities involved, this will likely remain true. However, there are ways to gain some clues as to what is happening in the proverbial black box. For example, in CNN analysis of images, class activated maps are a way to identify which areas of a picture are most active or influential in its classification.


For time-series or other structured data, embeddings offer an alternative to the standard one-hot encoding or ‘dummy’ variable representation for categorical features that can lead to insights about their effect patterns on the dependent variable (Guo and Berkhahn2016). Embedding matrices are estimated along with the model fit, and their dimension is typically chosen to be somewhat less than the cardinality of the variable, e.g. an embedding matrix representing days of the week could have the dimensions 7 x 4, allowing each day to be represented by a 4-dimensional embedding vector. This allows patterns and similarities or differences among days to be captured along 4 dimensions.

