Last week at InfoQ Live, Blanca Garcia-Gil, principal systems engineer at BBC, gave a session titled Evolving Analytics in the Data Platform. During this session, Garcia-Gil focused on how her team prepared and designed for failure.

The BBC maintains a data platform to understand their audiences better to help them make the most of what the BBC has to offer. They operate their platform on AWS, and the analytics pipeline is the highest scale data pipeline built by the team. It handles billions of messages per day, and its associated data lake is at Petabytes scale.

As the team designed the pipeline, they planned for two types of failure modes - “known unknowns” and “unknown unknowns.” Known unknowns are failures that the team can predict might happen. Metrics, logging, and monitoring dashboards are the primary tools used to handle this type of failure. Unknown unknowns, on the other hand, are failures that they can’t control or predict. Garcia-Gil says that it’s inefficient to anticipate and plan for each of these failures, so they need to have the tooling to investigate those issues as they happen. This tooling has eventually benefitted Garcia-Gil’s team in different ways, such as abstracting detail from those who do not need to be aware of it and significantly reducing the time to solve incidents.

The following diagram depicts the pipeline architecture, which is, in part, an outcome of designing for failure.

#apache airflow #apache spark #postgres #aws #s3 #business analytics #big data #infoq live 2020 #the bbc #infoq live #redshift #data analytics #data pipelines #ai # ml & data engineering #architecture & design #devops #news

Designing for Failure in the BBC's Analytics Platform
1.20 GEEK