Data Governance for Data Scientists?

Have you ever asked yourself why Data Governance has a huge impact in your Machine Learning Models? Let me explain to you in 5 minutes.

Well if you are thinking that this post will be a bored history about technical concepts and more topics with those you usually heard and them make you get sleep, I will try to make it a little different and after you read it, you can understand why this topic has a huge impact into a Data Scientist’s activities.

Okay let’s Rock ! ! !

First of all… What is Data Governance?

Data Governance has a lot of meanings, I will keep my word and I will define it in few words:

“Set of data-focused best practices that build quality deliverables”

Well, now you know my own definition and you can get it easily, as well as the main porpuse into ecosystem information from each company, entity, business, etc.

Data Governance is a project?

Sorry but it is not a project, buddy… Data Governance is a function which you need to add in almost all your activities. In few words, whether you are developing a Machine Learning model or a Data Engine, you will have an output and this is a transformation which you need to explain why is a certificated deliverable.

How many components it has?

According to the previous explanation we touched 2 important Data Governance components:

  1. When you create a model or data engine, you are generating new entities which are going to live in your data ecosystem and it makaes you a** Data Owner or Custodian** (In Data Governance, this role is the most relevant because it helps to identify granular definitons — entities — ). In addition, when you put your model in production you need to give documentation about your input and output variables… As you can see, now you are aligning to the first step “Data Content”
  2. On other hand, when you can explain to others how it can be useful for the company, you can prove what is the correct way to get this value. My dear dear friend, now you know what is “Data Quality”

According to the previous components, we can make mention to the last two:

3.- Into developing phase, you need to map to the best data sources aligned to the model needs and for getting done this part you need to include a mapping into your documentation where you can explain where is the roadmap from your data until arriving to the deliverable (end to end)… And it is so how we get our “Data Lineage”

4.- And the last but not the least important… “Data Availability”. For this phase is not necesary explain it deeply because the main objective is democratisize your entities for who need to use it around your organization

Photo by Franki Chamaki on Unsplash

So now what?

Great, as far as now we know which are principal components in a Data Governance Program (yes it is a program which helps to improve the company assets, because for having in consideration, data is the new oil in the 21st century).

Moreover, we can use its principal objective which described above, but for doing more digestible this conclusion, I would like to put focus in its importance into our data ecosystem. Why? It is simple, as example, you can create the best model with the best algorithms for predicting a desired goal… but let me say that all your data sources has the worst quality and nobody knows how it was constructed or where you can find the best entity for calculating your prediction. What do you think could be your output? (Well you know the answer, buddy).

