Nginx Log Analytics with AWS Athena and Cube.js

Nginx Log Analytics with AWS Athena and Cube.js

Sometimes, existing commercial or out-of-the-box open-source tools like Grafana don’t fit requirements for Nginx log analytics. Whether it is pricing, privacy, or customization issues, it is always good to know how to build such a system internally.

Sometimes, existing commercial or out-of-the-box open-source tools like Grafana don’t fit requirements for Nginx log analytics. Whether it is pricing, privacy, or customization issues, it is always good to know how to build such a system internally.

In the following tutorial, I’ll show you how to build your own Nginx log analytics with FluentdKinesis Data FirehoseGlueAthena, and Cube.js. This stack also makes it easy to add data from other sources, such as Snowplow events, into the same S3 bucket and merge results in Athena. I’ll walk you through the whole pipeline from data collection to the visualization.

Here is the live demo of the final dashboard.

Here’s a sample architecture of the application we’re going to build:

Collecting Nginx Logs

By default, Nginx writes logs in a plain text format like this:

4/9/2019 12:58:17 PM1.1.1.1 - - [09/Apr/2019:09:58:17 +0000] "GET /sign-up HTTP/2.0" 200 9168 "https://example.com/sign-in" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36" "-"
4/9/2019 12:58:17 PM1.1.1.1 - - [09/Apr/2019:09:58:17 +0000] "GET /sign-in HTTP/2.0" 200 9168 "https://example.com/sign-up" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36" "-"

Although we can parse that, it would be much easier to change Nginx configuration to use the JSON format.

log_format json_combined escape=json '{ "created_at": "$msec", '
            '"remote_addr": "$remote_addr", '
            '"remote_user": "$remote_user", '
            '"request": "$request", '
            '"status": $status, '
            '"bytes_sent": $bytes_sent, '
            '"request_length": $request_length, '
            '"request_time": $request_time, '
            '"http_referrer": "$http_referer", '
            '"http_x_forwarded_for": "$http_x_forwarded_for", '
            '"http_user_agent": "$http_user_agent" }';

access_log /var/log/nginx/access.log json_combined;

Create an S3 Bucket

Create a new S3 bucket for the log files. Note that logs should be placed in the root of the bucket and no other data can be stored in that bucket. Also, consider using the same region as your Athena because Athena is not available in some regions.

Define a Table Schema via the Athena Console

Open your Athena console and select the database you'd be using. To create a table, you can use AWS UI or just run create statement in the console.

CREATE EXTERNAL TABLE kinesis_logs_nginx(
created_at double,
remote_addr string,
remote_user string,
request string,
status int,
bytes_sent int,
request_length int,
request_time double,
http_referrer string,
http_x_forwarded_for string,
http_user_agent string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
's3://<YOUR-S3-BUCKET>'
TBLPROPERTIES ('has_encrypted_data'='false');

Create a Kinesis Firehose Stream

Open up the Kinesis Firehose console and click "Create delivery stream." Enter a name for the stream and set delivery to direct PUT.

Click next. Select Record format conversion to EnabledOutput format to Apache ORC, and select the database you created as the schema source.

Select S3 as a storage format. Select the S3 bucket you created for log files and leave S3 Prefix empty.

You can change other options depending on your load, but we'll continue with the default ones. S3 compression is not available since the ORC format includes compression. Now you've created the Kinesis Firehose stream!

Fluentd

There are many different ways to install and use Fluentd. We'll stick with the Docker version.

First, create a fluent.conf file and add a new Fluentd source to collect logs:

<source>
@type forward
port 24224
bind 0.0.0.0
</source>

Now we can run a Fluentd server with this configuration. Here's a basic configuration; you can check out other options on Docker Hub.

$ docker run 
-d
-p 24224:24224
-p 24224:24224/udp
-v /data:/fluentd/log
-v <PATH-TO-FLUENT-CONF>:/fluentd/etc fluentd
-c /fluentd/etc/fluent.conf
fluent/fluentd:stable

This configuration uses the /fluentd/log path for cache files. You can run Fluentd in Docker without mounting the data directory, but in the case of a restart, you can lose all cached logs. Also, you can change the default 24224 port to any other unused port. Now, as we have a running Fluentd server, we can stream Nginx logs to it.

Since we run Nginx in Docker, we can run it with the built-in docker Fluentd log driver:

$ docker run 
--log-driver=fluentd
--log-opt fluentd-address=<FLUENTD-SERVER-ADDRESS>
--log-opt tag="{{.Name}}"
-v /some/content:/usr/share/nginx/html:ro
-d
nginx

We'll use the Amazon Kinesis Output Plugin. It can store logs captured from Nginx as ORC files.

Next, select the Nginx log using tag prefix match and parse JSON:

<filter YOUR-NGINX-TAG.>
@type parser
key_name log
emit_invalid_record_to_error false
<parse>
@type json
</parse>
</filter>

We’re using the kinesis_firehose output plugin to send parsed logs to Kinesis Firehose:

<match YOUR-NGINX-TAG.>
@type kinesis_firehose
region region
delivery_stream_name <YOUR-KINESIS-STREAM-NAME>
aws_key_id <YOUR-AWS-KEY-ID>
aws_sec_key <YOUR_AWS-SEC_KEY>
</match>
Athena

Now you can query Nginx logs in Athena with SQL. Let's find some recent errors:

SELECT * FROM "db_name"."table_name" WHERE status > 499 ORDER BY created_at DESC limit 10;

Full Scan for Each Request

Now we have logs parsed and delivered to S3 in the ORC format, which is compressed and efficient to query. Also, Kinesis Firehose partitions logs by date and hour, but querying and filtering them requires Athena to scan all files. This is a problem, because the more logs you store, the more data gets scanned per request. It is slow and also pricey, because Athena pricing depends on scanned data volume.

To solve this, we'll use AWS Glue Crawler, which gathers partition data from S3 and writes it to the Glue Metastore. Once data is partitioned, Athena will only scan data in selected partitions. It makes querying much more efficient in terms of time and cost.

Setting an Amazon Glue Crawler

Amazon Glue Crawler can scan the data in the bucket and create a partitioned table for that data.

Create a Glue Crawler and add the bucket you use to store logs from Kinesis. You can add multiple buckets to be scanned on each run, and the crawler will create separate tables for each bucket. Make sure to schedule this crawler to run periodically depending on your needs. We usually run a single crawler for all buckets every hour. Also, it's easier to use a separate database for all crawled tables.

Partitioned Tables

Open your Athena console and choose the database you selected in the crawler configuration. After the first run of the crawler, you should see a table named the same as the S3 bucket where all log files are located. You can query results for some hours and filter the data by partitions.

SELECT * FROM "default"."part_demo_kinesis_bucket"
WHERE(
partition_0 = '2019' AND
partition_1 = '04' AND
partition_2 = '08' AND
partition_3 = '06'
);

This query will select all events that came from 6 AM to 7 AM on April 8, 2019.

Did that help? Let's run the same request without a partition filter.

You can see that the query took 3.59 seconds to run and processed 244.34 megabytes of data. Let's use partition helpers:

This query is a bit faster and, more importantly, took only 1.23 megabytes of scanned data. On larger datasets, the difference grows dramatically.

Building a Dashboard with Cube.js

We will use Cube.js to query Athena and build a dashboard. It has plenty of powerful features, but the main reasons we would want to use it, in this case, are automatic handling of partitions and pre-aggregations.

Cube.js can dynamically generate SQL, taking care of partitioning. It uses data schema, which is written in Javascript, to generate SQL. We can put instructions on how to filter partitions in the data schema.

Let’s create a new Cube.js app with a serverless deployment option. Since we’re using Athena, serverless is the easiest way to deploy. If you prefer other deployment environments, such as Docker or Heroku, you can check the documentation on deployment options here.

$ npm install -g cubejs-cli
$ cubejs create nginx-log-analytics -t serverless -d athena

Cube.js uses environment variables for database credentials. On new app creation, the CLI generates the .env file with placeholders in the project directory. Fill it with your Athena credentials.

Now, we can create a data schema file, where we’ll define how Cube.js should query the Athena logs table, as well as measures and dimensions for our dashboard. If you are just getting started with Cube.js, I recommend checking thisor that tutorial to learn more about data schemas and how Cube.js generates SQL.

In the schema folder, create the file Logs.js with the following content:

const partitionFilter = (from, to) =>     date(from_iso8601_timestamp(${from})) &lt;= date_parse(partition_0 || partition_1 || partition_2, '%Y%m%d') AND     date(from_iso8601_timestamp(${to})) &gt;= date_parse(partition_0 || partition_1 || partition_2, '%Y%m%d')    

cube(Logs, {
sql: select * from part_demo_kinesis_bucket WHERE ${FILTER_PARAMS.Logs.createdAt.filter(partitionFilter)} ,

measures: {
count: {
type: count,
},

errorCount: {
  type: `count`,
  filters: [
    { sql: `${CUBE.isError} = 'Yes'` }
  ]
},

errorRate: {
  type: `number`,
  sql: `100.0 * ${errorCount} / ${count}`,
  format: `percent`
}

},

dimensions: {
status: {
sql: status,
type: number
},

isError: {
  type: `string`,
  case: {
    when: [{
      sql: `${CUBE}.status &gt;= 400`, label: `Yes`
    }],
    else: { label: `No` }
  }
},

createdAt: {
  sql: `from_unixtime(created_at)`,
  type: `time`
}

}
});

In the top level sql expression for the Logs cube, we are using the FILTER_PARAMS feature of Cube.js to dynamically generate SQL based on the passed filter parameters.

We also define measures and dimensions we’ll use in our dashboard. One last thing to do before building a frontend is to set up pre-aggregations. The Cube.js pre-aggregation engine builds a layer of aggregated data in your database during the runtime and keeps it up-to-date. It can significantly speed up the performance and also in the case of Athena reduce billing as well. This guide covers using pre-aggregations in more details.

To add it to our schema, add the following block to the end of the cube definition.

preAggregations: {
main: {
type: rollup,
measureReferences: [count, errorCount],
dimensionReferences: [isError, status],
timeDimensionReference: createdAt,
granularity: day,
partitionGranularity: month,
refreshKey: {
sql: FILTER_PARAMS.Logs.createdAt.filter((from, to) =>
select CASE WHEN from_iso8601_timestamp(${to}) + interval '3' day &gt; now() THEN date_trunc('hour', now()) END
)
}
}
}

We’re pre-aggregating all the measures and dimensions we’ll use and also making this pre-aggregation to be partitioned by month. Partitioning pre-aggregationscan dramatically increase background refresh time.

Now, we are ready to build our frontend dashboard.

Cube.js provides REST API, a Javascript client, and bindings for popular frameworks, such as React and Vue. The clients are visualization agnostic and take care of API calls and data formatting, letting developers use any visualization library.

The Cube.js server accepts a query in a JSON format with measures and dimensions. It then generates and executes SQL against Athena, and sends the result back. For example, to load the count of requests with an error over time by day, we can use the following request:

{
"measures": ["Logs.errorCount"],
"timeDimensions": [
{
"dimension": "Logs.createdAt",
"dateRange": ["2019-01-01", "2019-01-07"],
"granularity": "day"
}
]
}

You can install the Cube.js Javascript Client and React binding with NPM.

$ npm i --save @cubejs-client/core @cubejs-client/react

Then import the cubejs and QueryRenderer components, and use them to fetch the data. In the example below, we use Recharts to visualize data.

import React from 'react';
import { LineChart, Line, XAxis, YAxis } from 'recharts';
import cubejs from '@cubejs-client/core';
import { QueryRenderer } from '@cubejs-client/react';

const cubejsApi = cubejs(
'YOUR-CUBEJS-API-TOKEN',
{ apiUrl: 'http://localhost:4000/cubejs-api/v1' },
);

export default () => {
return (
<QueryRenderer
query={{
measures: ['Logs.errorCount'],
timeDimensions: [{
dimension: 'Logs.createdAt',
dateRange: ['2019-01-01', '2019-01-07'],
granularity: 'day'
}]
}}
cubejsApi={cubejsApi}
render={({ resultSet }) => {
if (!resultSet) {
return 'Loading...';
}

    return (
      &lt;LineChart data={resultSet.rawData()}&gt;
        &lt;XAxis dataKey="Logs.createdAt"/&gt;
        &lt;YAxis/&gt;
        &lt;Line type="monotone" dataKey="Logs.errorCount" stroke="#8884d8"/&gt;
      &lt;/LineChart&gt;
    );
  }}
/&gt;

)
}

This tutorial goes into great detail on how to build dashboards with Cube.js.

You can find a CodeSandbox with a demo dashboard built with React and Cube.js below.









Python For Data Analysis | Build a Data Analysis Library from Scratch | Learn Python in 2019

Python For Data Analysis - Build a Data Analysis Library from Scratch - Learn Python in 2019

**
**

Immerse yourself in a long, comprehensive project that teaches advanced Python concepts to build an entire library

You’ll learn

  • How to build a Python library similar pandas
  • How to complete a large, comprehensive project
  • Test-driven development with pytest
  • Environment creation
  • Advanced Python topics such as special methods and property decorators
  • A fully-functioning library that you can use to data analysis

Data Science vs Data Analytics vs Big Data

Data Science vs Data Analytics vs Big Data

When we talk about data processing, Data Science vs Big Data vs Data Analytics are the terms that one might think of and there has always been a confusion between them. In this article on Data science vs Big Data vs Data Analytics, I will understand the similarities and differences between them

When we talk about data processing, Data Science vs Big Data vs Data Analytics are the terms that one might think of and there has always been a confusion between them. In this article on Data science vs Big Data vs Data Analytics, I will understand the similarities and differences between them

We live in a data-driven world. In fact, the amount of digital data that exists is growing at a rapid rate, doubling every two years, and changing the way we live. Now that Hadoop and other frameworks have resolved the problem of storage, the main focus on data has shifted to processing this huge amount of data. When we talk about data processing, Data Science vs Big Data vs Data Analytics are the terms that one might think of and there has always been a confusion between them.

In this article on Data Science vs Data Analytics vs Big Data, I will be covering the following topics in order to make you understand the similarities and differences between them.
Introduction to Data Science, Big Data & Data AnalyticsWhat does Data Scientist, Big Data Professional & Data Analyst do?Skill-set required to become Data Scientist, Big Data Professional & Data AnalystWhat is a Salary Prospect?Real time Use-case## Introduction to Data Science, Big Data, & Data Analytics

Let’s begin by understanding the terms Data Science vs Big Data vs Data Analytics.

What Is Data Science?

Data Science is a blend of various tools, algorithms, and machine learning principles with the goal to discover hidden patterns from the raw data.

[Source: gfycat.com]

It also involves solving a problem in various ways to arrive at the solution and on the other hand, it involves to design and construct new processes for data modeling and production using various prototypes, algorithms, predictive models, and custom analysis.

What is Big Data?

Big Data refers to the large amounts of data which is pouring in from various data sources and has different formats. It is something that can be used to analyze the insights which can lead to better decisions and strategic business moves.

[Source: gfycat.com]

What is Data Analytics?

Data Analytics is the science of examining raw data with the purpose of drawing conclusions about that information. It is all about discovering useful information from the data to support decision-making. This process involves inspecting, cleansing, transforming & modeling data.

[Source: ibm.com]

What Does Data Scientist, Big Data Professional & Data Analyst Do?

What does a Data Scientist do?

Data Scientists perform an exploratory analysis to discover insights from the data. They also use various advanced machine learning algorithms to identify the occurrence of a particular event in the future. This involves identifying hidden patterns, unknown correlations, market trends and other useful business information.

Roles of Data Scientist

What do Big Data Professionals do?

The responsibilities of big data professional lies around dealing with huge amount of heterogeneous data, which is gathered from various sources coming in at a high velocity.

Roles of Big Data Professiona

Big data professionals describe the structure and behavior of a big data solution and how it can be delivered using big data technologies such as Hadoop, Spark, Kafka etc. based on requirements.

What does a Data Analyst do?

Data analysts translate numbers into plain English. Every business collects data, like sales figures, market research, logistics, or transportation costs. A data analyst’s job is to take that data and use it to help companies to make better business decisions.

Roles of Data Analyst

Skill-Set Required To Become Data Scientist, Big Data Professional, & Data Analyst

What Is The Salary Prospect?

The below figure shows the average salary structure of **Data Scientist, Big Data Specialist, **and Data Analyst.

A Scenario Illustrating The Use Of Data Science vs Big Data vs Data Analytics.

Now, let’s try to understand how can we garner benefits by combining all three of them together.

Let’s take an example of Netflix and see how they join forces in achieving the goal.

First, let’s understand the role of* Big Data Professional* in Netflix example.

Netflix generates a huge amount of unstructured data in forms of text, audio, video files and many more. If we try to process this dark (unstructured) data using the traditional approach, it becomes a complicated task.

Approach in Netflix

Traditional Data Processing

Hence a Big Data Professional designs and creates an environment using Big Data tools to ease the processing of Netflix Data.

Big Data approach to process Netflix data

Now, let’s see how Data Scientist Optimizes the Netflix Streaming experience.

Role of Data Scientist in Optimizing the Netflix streaming experience

1. Understanding the impact of QoE on user behavior

User behavior refers to the way how a user interacts with the Netflix service, and data scientists use the data to both understand and predict behavior. For example, how would a change to the Netflix product affect the number of hours that members watch? To improve the streaming experience, Data Scientists look at QoE metrics that are likely to have an impact on user behavior. One metric of interest is the rebuffer rate, which is a measure of how often playback is temporarily interrupted. Another metric is bitrate, that refers to the quality of the picture that is served/seen — a very low bitrate corresponds to a fuzzy picture.

2. Improving the streaming experience

How do Data Scientists use data to provide the best user experience once a member hits “play” on Netflix?

One approach is to look at the algorithms that run in real-time or near real-time once playback has started, which determine what bitrate should be served, what server to download that content from, etc.

For example, a member with a high-bandwidth connection on a home network could have very different expectations and experience compared to a member with low bandwidth on a mobile device on a cellular network.

By determining all these factors one can improve the streaming experience.

3. Optimize content caching

A set of big data problems also exists on the content delivery side.

The key idea here is to locate the content closer (in terms of network hops) to Netflix members to provide a great experience. By viewing the behavior of the members being served and the experience, one can optimize the decisions around content caching.

4. Improving content quality

Another approach to improving user experience involves looking at the quality of content, i.e. the video, audio, subtitles, closed captions, etc. that are part of the movie or show. Netflix receives content from the studios in the form of digital assets that are then encoded and quality checked before they go live on the content servers.

In addition to the internal quality checks, Data scientists also receive feedback from our members when they discover issues while viewing.

By combining member feedback with intrinsic factors related to viewing behavior, they build the models to predict whether a particular piece of content has a quality issue. Machine learning models along with natural language processing (NLP) and text mining techniques can be used to build powerful models to both improve the quality of content that goes live and also use the information provided by the Netflix users to close the loop on quality and replace content that does not meet the expectations of the users.

So this is how Data Scientist optimizes the Netflix streaming experience.

Now let’s understand how Data Analytics is used to drive the Netflix success.

Role of Data Analyst in Netflix

The above figure shows the different types of users who watch the video/play on Netflix. Each of them has their own choices and preferences.

So what does a Data Analyst do?

Data Analyst creates a user stream based on the preferences of users. For example, if user 1 and user 2 have the same preference or a choice of video, then data analyst creates a user stream for those choices. And also –
Orders the Netflix collection for each member profile in a personalized way.We know that the same genre row for each member has an entirely different selection of videos.Picks out the top personalized recommendations from the entire catalog, focusing on the titles that are top on ranking.By capturing all events and user activities on Netflix, data analyst pops out the trending video.Sorts the recently watched titles and estimates whether the member will continue to watch or rewatch or stop watching etc.
I hope you have *understood *the *differences *& *similarities *between Data Science vs Big Data vs Data Analytics.

Why is Python used so widely in big data analysis despite of it being slow?

I have noticed that Python is used a lot in big data.

People call C functions from Python, then process it further in Python, then call some other libraries, possibly again in Python that also look at gigantic data arrays.

Isn't this an extremely inefficient way of doing things? Python is much slower than C++. How can it make sense to use Python in situations when large data is processed, performance-wise?

One company asked me the question "How to bind a C-function to Python that computes a 1GB floating-point array, and then to compute a total of all numbers in Python?" They ask this question from the position when they assume that the use of Python is totally normal, and one should do such things as computing a 1GB fp array in C, then copying it into a gigantic Python list, then computing a total of numbers in Python. But this question in itself assumes that things are done extremely inefficiently, isn't it? They are just indoctrinated and think that things that they do are normal when they are far from normal.

So why is Python used so widely, as opposed to using C++, for example? Is this because many people feel that Python is much easier and C++ is too hard?