What it is really like to develop a model for a real-world business case
Originally published by Rebecca Vickery at https://towardsdatascience.com
Have you ever taken part in a Kaggle competition? If you are studying, or have studied machine learning it is fairly likely that at some point you will have entered one. It is definitely a great way to put your model building skills into practice and I spent quite a bit of time on Kaggle when I was studying.
If you have previously taken part in a machine learning competition then your workflow may have looked a little something like mine did:
However, if you are developing a machine learning model for a real-world business application the process will look quite different. The first time I deployed a model in a business scenario these differences were quite surprising in particular how much time was spent at certain stages of the workflow. In the following post, I wanted to describe the process for developing a model in a business setting and discuss these differences in detail and explain why they exist.
The workflow, in a business case, will have a lot more steps and may look a little something like this.
In the rest of this post, I am going to go into a little more detail on each of these steps.
In a Kaggle competition, the problem to solve will be clearly defined upfront. For example in one of the latest competitions called ‘Severstal: Steel Defect Detection’ you are given some curated data and the problem is clearly stated in the form of a data problem.
Today, Severstal uses images from high frequency cameras to power a defect detection algorithm. In this competition, you’ll help engineers improve the algorithm by localizing and classifying surface defects on a steel sheet.
In a real business problem, it is unlikely that you will necessarily be asked to build a specific type of model. It is more likely that a team or product manager will come to you with a business problem. This might look something like this, and sometimes the problem may not even be this well defined.
The customer service team want to reduce the time it takes the business to reply to customer emails, live chats and phone calls in order to create a better experience for customers and improve customer satisfaction metrics.
From this business request, you will need to work with the team to plan out and design the best solution to this problem before you can make a start on building the actual model.
The data you work with will almost certainly not be clean. There will usually be missing values to work with. Dates might be in the wrong format. There may be typos in values, incorrect data and outliers. Before you get anywhere near actually building the model the chances are that a lot of time will be spent in removing erroneous data, outliers and handling missing values.
Similarly, all the data you need may not be from one simple source. For a data science project, you might need to source data from a combination of any of the following: SQL queries (sometimes across multiple databases), third-party systems, web scraping, API’s or data from partners. Similar to data cleaning this part can often be a very time-consuming part of the project.
In a machine learning competition generally, you have a given data set with a finite number of variables that you can use in your model. Feature selection and engineering are still necessary but you have a limited number of variables to select from in the first place. When working on a real-world problem you will more than likely have access to a vast range of variables. As a data scientist, you will have to select the data points that are likely to result in a good model to solve the problem. Therefore you will need to use a combination of exploratory data analysis, intuition and domain knowledge to select the right data to build the model from.
Having spent all this time in selecting, extracting and cleaning the data the time you spend actually building the model will be very small in comparison. For version 1 of a model in particular, where you may want to use the model as a baseline test, then you may in the first instance only spend a small amount of time on model selection and tuning. Once the business value has been proven you might then invest more time in optimising the model.
In a Kaggle competition, it is not unusual to spend weeks on tuning a model to get a small improvement in the model score. As this small improvement is likely to boost you quite a few spaces up the leader board. For example in the current Severstal competition, the difference in score between position 1 and 2 on the leaderboard is currently only 0.002. It would definitely be worth spending time to improve your score by a very tiny amount as it might bag you the top prize.
In business, the time that you spend on tuning a model is a cost. The company has to pay your wage for the number of days or weeks that you spend on this task. As with everything there needs to be a return on this investment in the form of business value. It is quite unlikely that a business use case for a model would provide enough value to justify spending days on improving the accuracy of a model by an increment of 0.002. In reality, you will tune the model until it is ‘good enough’ rather than ‘the best’.
This leads me onto my next point which is that you won’t always use the best model or the newest deep learning methods. Quite often you will be able to deliver more business value with a simpler model such as linear regression. Which takes less time (and therefore costs less to build) and is more explainable.
Your model will have to connect to some kind of endpoint such as a website. The existing tech stack for this endpoint will have a lot of bearing on the type of model you will deploy. There will often be a compromise from both data scientists and software engineers on minimising the engineering work at both ends. If you have a new model which would mean a change to an existing deployment processes or extensive engineering work then you would have to have a very good business case for deploying it.
Once in production, the model will need to be monitored to ensure that it is performing as well as it did during training and validation and to check for model degradation. For a number of reasons the performance of a model usually degrades over time. This is due to the fact that data will change with time, as customer behaviour changes for example, and therefore your model may start to not perform as well on this new data. For this reason, models will also need to be retrained regularly to maintain business performance.
Additionally, most businesses will have a test and learn cycle for deploying machine learning models. So your first model will usually be version 1 to form a baseline for performance. After that you will make improvements to the model, maybe changing features or tuning the model, to deploy a better version and test against the original model.
Both of these processes may be ongoing until the business case no longer exists for this model.
This post was partly inspired by this tweet from Chip Huyen.
Part of the reason why it is hard to recruit for machine learning roles is that many of the realities of deploying machine learning in business that I have discussed here are not taught in these courses. That is why I am such a fan of the practical first approach to learning and why I think that industrial placements, internships and junior data science roles are so important.
There is light at the end of the tunnel however as technology in this space is rapidly evolving helping to automate processes such as data cleaning and model deployment. However, we still have a way to go so it is still vitally important for data scientists to develop software engineering skills, improve communication skills, and to have a resilient and persistent mindset, alongside the typical data scientist skillset.
Thanks for reading!
If you liked this post, share it with all of your programming buddies!
Most popular Data Science and Machine Learning courses — August 2020. This list was last updated in August 2020 — and will be updated regularly so as to keep it relevant
Learning is a new fun in the field of Machine Learning and Data Science. In this article, we’ll be discussing 15 machine learning and data science projects.
Data Science and Analytics market evolves to adapt to the constantly changing economic and business environments. Our latest survey report suggests that as the overall Data Science and Analytics market evolves to adapt to the constantly changing economic and business environments, data scientists and AI practitioners should be aware of the skills and tools that the broader community is working on. A good grip in these skills will further help data science enthusiasts to get the best jobs that various industries in their data science functions are offering.
Machine Learning Pipelines performs a complete workflow with an ordered sequence of the process involved in a Machine Learning task. The Pipelines can also
“How’d you get started with machine learning and data science?”: I trained my first model in 2017 on my friend's lounge room floor.