Do you like reading crime stories? Personally, I adore them. And, when I was younger, I admired Sherlock Holmes and Hercule Poirot so much, that I wanted to become like them when I grow up :)

What were these two characters best in? They could infallible profile the people, collect the facts based on that profiling, so in the end, everything looked so easy and obvious — like anyone can do that!

Honestly, my detective skills were not even close enough to be successful in that “job”. But, instead of profiling people, I decided to profile the data…

What is Data Profiling?

According to Ralph Kimball, data profiling is a systematic upfront analysis of the content of a data source. There are multiple keywords in this sentence, but let’s just focus on a few of them:

  • You have to know your data before you can start to work with it (upfront)
  • You have to check all aspects of your data, from checking the memory footprint and cardinality to complex business rules (systematic)
  • You should perform data profiling on the source data — there is a famous sentence in the data warehousing world: Garbage IN, garbage OUT! In simple words, if your data is of bad quality in the source itself, you can’t expect that your reports can display accurate numbers

There are multiple types of data profiling techniques:

  1. Completeness — how many blanks/nulls do I have in my column?
  2. Uniqueness — how many unique values (cardinality) do I have in my column? Do I have any duplicates? Is it allowed to have duplicates?
  3. Value distribution — distribution of records across different values for a specific attribute
  4. Range — finding the minimum, maximum, average value within the column

#data-science #towards-data-science #power-bi #data-modeling #big-data

Go, get’em, Sherlock! Doing Data Profiling in Power BI like a PRO
1.35 GEEK