Data analytics on AWS: The Complete Guide for A Great Start

With human society transforming to a wholly digitally connected community, the amount of generated and collected data is growing outstandingly. Nowadays, we have enormous amounts of data available about both internal business processes and customer’s behaviors.

This provides us the opportunity to exploit ever-larger quantities of information of ever-better quality at a reduced cost.

A fitting example is customer behavior profiling: businesses can gather information about what products customers use, how they use them, which aspects of each product are really relevant in their daily life, and a lot more.

Moving from everyday life to industry, it is now possible to obtain information on which components of machinery are subject to wear and act accordingly (predictive maintenance). It is also possible to obtain data about defective pieces to improve the production lines, and so on.

The ability to understand which data to extract, collect it efficiently, store it at low cost, and finally to analyze it is therefore what really makes the difference, and it is a significant competitive advantage.

The effort to analyze this astonishing amount of data becomes challenging using only traditional on-premise solutions. AWS provides a broad set of fully managed services to help you build and scale big data applications in the Cloud. Whether your application requires real-time streaming or batch processing, Amazon Web Services has the services needed to build a complete data analytics solution.

The data analysis pipeline

To collect, store, and analyze data we need to deep dive into 4 fields that are commonly needed regardless of the type of project and implementation: Collection, Data Lake / Storage, Processing, Analysis and Visualization.

So generally, the steps of a typical data analysis pipeline can be summarized as follows:

An appropriate infrastructure collects (ingests) the data from the field.
The data is stored using an appropriate storage service, optimizing the data access pattern.
Data is then processed by reading the input from the storage service, performing the required operations, and then storing the processed data in another location.
Eventually, the processed information can be visualized using a business intelligence (BI) tool to obtain the dashboard and data presentation for the end-users and the business.

In the following sections, we will discuss each of the steps, with a focus on AWS services you can leverage to seamlessly implement the pipeline steps.

Collection

In order to design the correct data collection system, you need to think about the features of the data and consider the expectations regarding data latency, cost, and durability.

The first aspect to consider is the ingestion frequency, which It is the measure of how often the data will be sent to your collection system. It can also be referred to as the temperature (Hot, Warm, Cold) of the data. The frequency of the input data dictates the kind of infrastructure to design. Transactional data (SQL) are better ingested using tools like AWS Database Migration Service, while real-time and near real-time data streams are the perfect use case for Amazon Kinesis Data Streams and Kinesis Firehose.

Amazon Kinesis Data Streams is a reliable, durable, cost-effective way to collect large amounts of data from the field and from Mobile or web applications. One of the many advantages of Kinesis Data Streams is that you can extend its functionality with custom software to meet your needs exactly. It may store the data it has collected for up to 7 days, and it supports multiple applications producing data for the same stream. You can provide custom software leveraging Lambda Functions.

Amazon Kinesis Data Firehose fully manages many of the manual processes that Kinesis Data Streams requires, and also includes no-code configuration options to automatically deliver the data to other AWS services. It makes it easy to group the data into batches, and make aggregations. Kinesis Data Firehose streams can be configured with consumers, including Kinesis Data Streams, Amazon Amazon S3, Amazon Redshift, and Amazon ElasticSearch.

Cold data, which is generated by applications that can be periodically batch-processed may be efficiently collected using Amazon EMR or AWS Glue.

#data-visualization #data-analysis #data-processing #aws #data-lake

The data analysis pipeline

Collection

medium.com

Data analytics on AWS: The Complete Guide for A Great Start