Building your models and analysis on solid foundations.Garbage in, garbage out. So goes the familiar phrase, born in the early days of Computer Science, pressing the importance of validating your inputs.
Image by XKCD
Garbage in, garbage out. So goes the familiar phrase, born in the early days of Computer Science, pressing the importance of validating your inputs.
You can have the most ingenious, elegant, well-tested function, model or application — but what comes out is only as good as what goes in.
Whenever we develop code, we make assumptions ahead of time about the nature of the data it will process; A simple arithmetic function might expect a single floating point number. A demand forecasting model for an refreshments kiosk could expect the last five years of sales figures in a particular tabular form. And a self-driving car controller would take in different streams of data from many sensors around the vehicle.
If these assumptions are violated then one of three things could happen.
The first scenario gives you a parachute, the second gives you a headache and the third gives you a multi-car pileup in a puddle of melted Cornetto.
Bad Data => Bad Decisions
As organisations becomes more data-mature, important business decisions are more frequently reliant on data analysis and modelling. If the data on which those decisions are made is not up to scratch, then the reasoning you base on that data will be flawed, with potentially very expensive consequences.
This is why understanding Data Quality and being aware of the many ways the data you’re using could fall short of your requirements is so important.
Every piece of data ever created, originated as an event or measurement in the real world. This could be the output of a temperature sensor, the logging of a financial transaction or someone typing their name into a web form. Accuracy describes the “degree to which the data correctly describes the ‘real world’ objects being described.”
In order for this to be achieved, each step on the journey from real-world to data-set has to correctly preserve the essence of the original.
A likely place for errors to occur is right at the start, during the measurement or recording of the event/object. In May 2020, the Australian government overestimated its spending commitments for a COVID 19 wage subsidy scheme by AUD $60 Billion (USD $39 Bn), due to mistakes made filling in a confusing application form. Employers were asked to state the number of employees they were enrolling in the scheme. However, in 0.1% of cases, they instead submitted the dollar value of the subsidies they required — 1,500 times the correct amount. These errors were missed and their aggregated value flowed into a bill passed by parliament. A few weeks later the government announced its mistake, red faced, but probably not too displeased for finding $60 Bn down the back of the sofa.
In the above example, simply listing the top 100 or so claimants would have likely shed light on the issue. You’d expect to find large fast-food and retail brands, hotel chains etc. but when you come across a local restaurant or small tour company claiming for thousands of employees, you know that there is a problem.
This highlights the importance of basic analysis and profiling to understand your dataset. Before you do any reporting or modelling, you need to be looking closely at each field to see if its values make sense, with no strange surprises.
Accuracy has a closely related cousin: precision. Stage times in the Tour de France are recorded in hours and seconds, but this wouldn’t work in the 100m final at the Olympics. Precision can be lost during data type conversion or due to the sensitivity of the instrument used to take the initial measurement and can result in lower variance being available to your model.
Data quality is top of mind for every data professional — and for good reason. Bad data costs companies valuable time, resources, and most of all, revenue.
Data Quality Testing Skills Needed For Data Integration Projects. Data integration projects fail for many reasons. Risks can be mitigated when well-trained testers deliver support. Here are some recommended testing skills.
Data Science and Analytics market evolves to adapt to the constantly changing economic and business environments. Our latest survey report suggests that as the overall Data Science and Analytics market evolves to adapt to the constantly changing economic and business environments, data scientists and AI practitioners should be aware of the skills and tools that the broader community is working on. A good grip in these skills will further help data science enthusiasts to get the best jobs that various industries in their data science functions are offering.
Data science is omnipresent to advanced statistical and machine learning methods. For whatever length of time that there is data to analyse, the need to investigate is obvious.
You will discover Exploratory Data Analysis (EDA), the techniques and tactics that you can use, and why you should be performing EDA on your next problem.