Understand your data with principal component analysis (PCA)

Save time, resources and stay healthy with data exploration that goes beyond means, distributions and correlations: Leverage PCA to see through the surface of variables. It saves time and resources, because it uncovers data issues before an hour-long model training and is good for a programmer’s health, since she trades off data worries with something more enjoyable. For example, a well-proven machine learning model might fail, because of one-dimensional data with insufficient variance or other related issues. PCA offers valuable insights that make you confident about data properties and its hidden dimensions.

This article shows how to leverage PCA to understand key properties of a dataset, saving time and resources down the road which ultimately leads to a happier, more fulfilled coding life. I hope this post helps to apply PCA in a consistent way and understand its results.

Image for post

TL;DR

PCA provides valuable insights that reach beyond descriptive statistics and help to discover underlying patterns. Two PCA metrics indicate 1. how many components capture the largest share of variance (explained variance), and 2., which features correlate with the most important components (factor loading). These metrics crosscheck previous steps in the project work flow, such as data collection which then can be adjusted**.** As a shortcut and ready-to-use tool, I provide the function do_pca() which conducts a PCA for a prepared dataset to inspect its results within seconds in this notebook or this script.

#factor-loading #data-science-toolbox #explained-variance #data-exploration #apc #data analysis

TL;DR

towardsdatascience.com

Understand your data with principal component analysis (PCA)