Building an Open Source Mixpanel Alternative. Part 1: Collecting and Displaying Events

Building an Open Source Mixpanel Alternative. Part 1: Collecting and Displaying Events

<em>This is the first part of a tutorial series on building an analytical web application with&nbsp;</em><a href="https://github.com/statsbotco/cube.js" target="_blank"><em>Cube.js</em></a><em>. It expects the reader to be familiar with Javascript, Node.js, React, and have basic knowledge of SQL.&nbsp;</em><a href="https://github.com/statsbotco/cube.js/tree/master/examples/event-analytics" target="_blank"><em>The final source code is available here</em></a><em>&nbsp;and&nbsp;</em><a href="https://d1ygcqhosay4lt.cloudfront.net/" target="_blank"><em>the live demo is here</em></a><em>. The example app is serverless and running on AWS Lambda. It displays data about its own usage.</em>

This is the first part of a tutorial series on building an analytical web application with Cube.js. It expects the reader to be familiar with Javascript, Node.js, React, and have basic knowledge of SQL. The final source code is available here and the live demo is here. The example app is serverless and running on AWS Lambda. It displays data about its own usage.

There is a category of analytics tools like Mixpanel or Amplitude, which are good at working with events data. They are ideal for measuring product or engagement metrics, such as activation funnels or retention. They are also very useful for measuring A/B tests.

Although all these tools do a job, they are proprietary and cloud-based. That could be a problem when privacy is a concern. Or if one wants to customize how funnels or retention work under the hood. While traditional BI tools, like Tableau or Power BI, could potentially be used to run the same analysis, they can not offer the same level of user experience. The problem is that they are designed to be general business intelligence tools, and not specific for funnels, retention, A/B tests, etc.

With recent advancements in frontend development, it became possible to rapidly develop complex user interfaces. Things which took a week to build five years ago could be built in an afternoon nowadays. On the backend and infrastructure side, cloud-based MPP databases, such as BigQuery and Athena, are dramatically changing the landscape. The ELT approach, when data is transformed inside the database, is getting more and more popular, replacing traditional ETL. Serverless architecture makes it possible to easily deploy and scale applications.

All of these made it possible to build internal alternatives to established services like Mixpanel, Amplitude, or Kissmetrics. In this series of tutorials, we’re going to build a full-featured open-source event analytics system.

It will include the following features:

  • Data collection;
  • Dashboarding;
  • Ad hoc analysis with query builder;
  • Funnel analysis;
  • Retention analysis;
  • Serverless deployment;
  • A/B tests;
  • Real-time events monitoring;

The diagram below shows the architecture of our application:

In the first part of our tutorial, we’ll focus more on how to collect and store data. And briefly cover how to make a simple chart based on this data. The following parts focus more on querying data and building various analytics reporting features.

Collecting Events


We’re going to use Snowplow Cloudfront Collector and Javascript Tracker. We need to upload a tracking pixel to Amazon CloudFront CDN. The Snowplow Tracker sends data to the collector by making a GET request for the pixel and passing data as a query string parameter. The CloudFront Collector uses CloudFront logging to record the request (including the query string) to an S3 bucket.

Next, we need to install Javascript Tracker. Here is the full guide.

But, in short, it is similar to Google Analytics’s tracking code or Mixpanel’s, so we need to just embed it into our HTML page.

<script type="text/javascript">      
  ;(function(p,l,o,w,i,n,g){if(!p[i]){p.GlobalSnowplowNamespace=p.GlobalSnowplowNamespace||[];
   p.GlobalSnowplowNamespace.push(i);p[i]=function(){(p[i].q=p[i].q||[]).push(arguments)
   };p[i].q=p[i].q||[];n=l.createElement(o);g=l.getElementsByTagName(o)[0];n.async=1;
   n.src=w;g.parentNode.insertBefore(n,g)}} .  (window,document,"script","//d1fc8wv8zag5ca.cloudfront.net/2.10.2/sp.js","snowplow"));

window.snowplow('newTracker', 'cf', '<YOUR_CLOUDFRONT_DISTRIBUTION_URL>’, { post: false });
</script>

Here you can find how it is embedded into our example application.

Once we have our data, which is CloudFront logs, in the S3 bucket, we can query it with Athena. All we need to do is create a table for CloudFront logs.

Copy and paste the following DDL statement into the Athena console. Modify the LOCATION for the S3 bucket that stores your logs.

CREATE EXTERNAL TABLE IF NOT EXISTS default.cloudfront_logs (
date DATE,
time STRING,
location STRING,
bytes BIGINT,
requestip STRING,
method STRING,
host STRING,
uri STRING,
status INT,
referrer STRING,
useragent STRING,
querystring STRING,
cookie STRING,
resulttype STRING,
requestid STRING,
hostheader STRING,
requestprotocol STRING,
requestbytes BIGINT,
timetaken FLOAT,
xforwardedfor STRING,
sslprotocol STRING,
sslcipher STRING,
responseresulttype STRING,
httpversion STRING,
filestatus STRING,
encryptedfields INT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LOCATION 's3://CloudFront_bucket_name/AWSLogs/Account_ID/'
TBLPROPERTIES ( 'skip.header.line.count'='2' )

Now we are ready to connect Cube.js to Athena and start building our first dashboard.

Building Our First Chart


First, install Cube.js CLI. It is used for various Cube.js workflows.

$ npm install -g cubejs-cli

Next, сreate a new Cube.js service by running the following command. Note, we are specifying Athena as a database here (-d athena) and template as serveless (-t serverless). Cube.js supports different configurations, but for this tutorial, we will use the serverless one.

$ cubejs create event-analytics-backend -d athena -t serverless

Once run, the create command will create a new project directory that contains the scaffolding for your new Cube.js project. This includes all the files necessary to spin up the Cube.js backend, example frontend code for displaying the results of Cube.js queries in a React app, and some example schema files to highlight the format of the Cube.js Data Schema layer.

The .env file in this project directory contains placeholders for the relevant database credentials. For Athena, you'll need to specify the AWS access and secret keys with the access necessary to run Athena queries, and the target AWS region and S3 output location where query results are stored.

CUBEJS_DB_TYPE=athena
CUBEJS_AWS_KEY=<YOUR ATHENA AWS KEY HERE>
CUBEJS_AWS_SECRET=<YOUR ATHENA SECRET KEY HERE>
CUBEJS_AWS_REGION=<AWS REGION STRING, e.g. us-east-1>

You can find the Athena S3 Output location here: https://docs.aws.amazon.com/athena/latest/ug/querying.html

CUBEJS_AWS_S3_OUTPUT_LOCATION=<S3 OUTPUT LOCATION>

Now, let’s create a basic Cube.js Schema for our events model. Cube.js uses Data Schema to generate and execute SQL; you can read more about it here.

Create a schema/Events.js file with the following content.

const regexp = (key) => &amp;${key}=([^&amp;]+);
const parameters = {
event: regexp('e'),
event_id: regexp('eid'),
page_title: regexp('page')
}

cube(Events, {
sql:
SELECT from_iso8601_timestamp(to_iso8601(date) || 'T' || "time") as time, ${Object.keys(parameters).map((key) =&gt; ( url_decode(url_decode(regexp_extract(querystring, '${parameters[key]}', 1))) as ${key})).join(", ")} FROM cloudfront_logs WHERE length(querystring) &gt; 1 ,

measures: {
pageView: {
type: count,
filters: [
{ sql: ${CUBE}.event = 'pv' }
]
},
},

dimensions: {
pageTitle: {
sql: page_title,
type: string
}
}
});

In the schema file, we create an Events cube. It is going to contain all the information about our events. In the base SQL statement, we’re extracting values from the query string sent by the tracker by using the regexp function. Cube.js is good at running transformations such this and it could also materialize some of them for performance optimization. We’ll talk about it in the next parts of our tutorial.

With this schema in place, we can run our dev server and build the first chart.

Spin up the development server by running the following command.

$ npm dev

Visit http://localhost:4000, it should open a CodeSandbox with an example. Change the renderChart function and the query variable to the following.

const renderChart = resultSet => (
<Chart height={400} data={resultSet.chartPivot()} forceFit>
<Coord type="theta" radius={0.75} />
<Axis name="Events.pageView" />
<Legend position="right" name="category" />
<Tooltip showTitle={false} />
<Geom type="intervalStack" position="Events.pageView" color="x" />
</Chart>
);

const query = {
measures: ["Events.pageView"],
dimensions: ["Events.pageTitle"]
};

Now, you should be able to see the pie chart, depending on what data you have in your S3.

In the next part, we’ll walk through how to build Conversion Funnels. In the third part we will make a dashboard and dynamic query builder, like one in Mixpanel or Amplitude. Part 3 will cover Retention Analysis. In the final part, we will discuss how to deploy the whole application in the serverless mode to AWS Lambda.

You can check out the full source code of the application here.

And the live demo is available here.

Top 7 Most Popular Node.js Frameworks You Should Know

Top 7 Most Popular Node.js Frameworks You Should Know

Node.js is an open-source, cross-platform, runtime environment that allows developers to run JavaScript outside of a browser. In this post, you'll see top 7 of the most popular Node frameworks at this point in time (ranked from high to low by GitHub stars).

Node.js is an open-source, cross-platform, runtime environment that allows developers to run JavaScript outside of a browser.

One of the main advantages of Node is that it enables developers to use JavaScript on both the front-end and the back-end of an application. This not only makes the source code of any app cleaner and more consistent, but it significantly speeds up app development too, as developers only need to use one language.

Node is fast, scalable, and easy to get started with. Its default package manager is npm, which means it also sports the largest ecosystem of open-source libraries. Node is used by companies such as NASA, Uber, Netflix, and Walmart.

But Node doesn't come alone. It comes with a plethora of frameworks. A Node framework can be pictured as the external scaffolding that you can build your app in. These frameworks are built on top of Node and extend the technology's functionality, mostly by making apps easier to prototype and develop, while also making them faster and more scalable.

Below are 7of the most popular Node frameworks at this point in time (ranked from high to low by GitHub stars).

Express

With over 43,000 GitHub stars, Express is the most popular Node framework. It brands itself as a fast, unopinionated, and minimalist framework. Express acts as middleware: it helps set up and configure routes to send and receive requests between the front-end and the database of an app.

Express provides lightweight, powerful tools for HTTP servers. It's a great framework for single-page apps, websites, hybrids, or public HTTP APIs. It supports over fourteen different template engines, so developers aren't forced into any specific ORM.

Meteor

Meteor is a full-stack JavaScript platform. It allows developers to build real-time web apps, i.e. apps where code changes are pushed to all browsers and devices in real-time. Additionally, servers send data over the wire, instead of HTML. The client renders the data.

The project has over 41,000 GitHub stars and is built to power large projects. Meteor is used by companies such as Mazda, Honeywell, Qualcomm, and IKEA. It has excellent documentation and a strong community behind it.

Koa

Koa is built by the same team that built Express. It uses ES6 methods that allow developers to work without callbacks. Developers also have more control over error-handling. Koa has no middleware within its core, which means that developers have more control over configuration, but which means that traditional Node middleware (e.g. req, res, next) won't work with Koa.

Koa already has over 26,000 GitHub stars. The Express developers built Koa because they wanted a lighter framework that was more expressive and more robust than Express. You can find out more about the differences between Koa and Express here.

Sails

Sails is a real-time, MVC framework for Node that's built on Express. It supports auto-generated REST APIs and comes with an easy WebSocket integration.

The project has over 20,000 stars on GitHub and is compatible with almost all databases (MySQL, MongoDB, PostgreSQL, Redis). It's also compatible with most front-end technologies (Angular, iOS, Android, React, and even Windows Phone).

Nest

Nest has over 15,000 GitHub stars. It uses progressive JavaScript and is built with TypeScript, which means it comes with strong typing. It combines elements of object-oriented programming, functional programming, and functional reactive programming.

Nest is packaged in such a way it serves as a complete development kit for writing enterprise-level apps. The framework uses Express, but is compatible with a wide range of other libraries.

LoopBack

LoopBack is a framework that allows developers to quickly create REST APIs. It has an easy-to-use CLI wizard and allows developers to create models either on their schema or dynamically. It also has a built-in API explorer.

LoopBack has over 12,000 GitHub stars and is used by companies such as GoDaddy, Symantec, and the Bank of America. It's compatible with many REST services and a wide variety of databases (MongoDB, Oracle, MySQL, PostgreSQL).

Hapi

Similar to Express, hapi serves data by intermediating between server-side and client-side. As such, it's can serve as a substitute for Express. Hapi allows developers to focus on writing reusable app logic in a modular and prescriptive fashion.

The project has over 11,000 GitHub stars. It has built-in support for input validation, caching, authentication, and more. Hapi was originally developed to handle all of Walmart's mobile traffic during Black Friday.

20. Node.js Lessons. Data Streams in Node.JS, fs.ReadStream

20. Node.js Lessons. Data Streams in Node.JS, fs.ReadStream

20. Node.js Lessons. Data Streams in Node.JS, fs.ReadStream

Hey all! Our topic for today is Data Streams In Node.js. We will try to learn all the aspects in details for the reason it turns out that on the one hand, common browser JavaScript development lack streams. And on the other hand, knowing and understanding stream principles is necessary for seamless server development because a stream is a versatile way of work with data sources universally used.

We can define two general stream types. The first one is

stream.Readable

It is a built-in class providing streams for reading. Generally, this type itself is never used, while its descendants are quite popular – in particular, we use fs.ReadStream to read from a file. To read from a visitor’s request for its handling, there is a special object familiar to us under its name req, which is the first argument of a request handler.

Stream.Writable

It is a versatile writing method. The very stream.Writable is rarely used, but its descendants – fs.WriteStream and res – are quite common.

There are some other stream types, but the most popular are these two and their variations.

The best way to understand streams is to see how they work in practice. So, right now we’ll start with using fs.ReadStream for reading a file. Let us create a file fs.js:

var fs = require('fs');
 
// fs.ReadStream nherits from stream.Readable
var stream = new fs.ReadStream(__filename);
 
stream.on('readable', function() {
    var data = stream.read();
    console.log(data);
});
 
stream.on('end', function() {
    console.log("THE END");
});

So, we get the module fs connected and create a stream:

var fs = require('fs');
 
var stream = new fs.ReadStream(__filename);

Stream is a JavaScript object receiving information about our resource – in our case, it is a path to the file (__filename) – which can work with this resource. fs.ReadStream implements a standard reading interface described in the stream.Readable class. Let us have a detailed look.

When a stream object new stream.Readable is created, it gets connected to the data source, which is file in our case, and tries to start reading from it. Once it has read something, it imitates the event readable. This event means that all the data have been computed and are contained within an inner stream buffer that can be received using the call read(). Then we can do something with data and wait for the next readable. This cycle will be the same.

Whenever the data source gets empty (however, there are certain sources that never get empty – for example, a random data generator), the file size is limited, so we will have the end, event in the very end meaning there will be no data anymore. Moreover we can call the method destroy() at any step of working with the stream. This method means we do not need the stream anymore and it can be closed, as well as the respective data sources and everything can be cleaned up.

So, let us refer to the original code. Here we create ReadStream, and it immediately wants to open up a file:

var stream = new fs.ReadStream(__filename);

but in our case it doesn’t necessarily mean the same string because any input/output-related operation is performed through libUV. At the same time, libUV has a structure that enables all synchronous input/output handlers to get implemented during the next event loop iteration, or once the current JavaScripthas finished its work. It means, we can seamlessly use all handlers knowing that they will be installed prior to the moment the first data fragment gets read. Launch fs.js.

Look at what has appeared in the console. The first one was the event readable. It outputted data. Right now it is an ordinary buffer, but we can transform it to the string by specifying the coding directly upon the stream opening.

var stream = new fs.ReadStream(__filename, {encoding: 'utf-8'});

Thus, the modification will be automatic. When a file ends, the event endoutputs THE END in the console. Here the file ended almost immediately because it was small at the moment. Let us modify our example a little bit by making a file big.html out of the current file contained in the current directory. Download this HTML file from our repository together with the other lesson materials.

Launch it. The file big.html is big, so the event readable has been initiated several times, and every time we received another data fragment as a buffer. So, let us calculate its length:

var fs = require('fs');
 
// fs.ReadStream nherits from stream.Readable
var stream = new fs.ReadStream("big.html");
 
stream.on('readable', function() {
    var data = stream.read();
    if (data){
        console.log(data.length);
    }
    else {
        console.log('data is null')
    }
});
 
stream.on('end', function() {
    console.log("THE END");
});

Get it launched. These numbers are the read file fragment length. When a stream opens a file, it reads only its part, but not the whole file, and inserts it into its internal variable. The maximum size is exactly 64 KB. Until we call stream.Read, it won’t read further. Once we’ve received the data, the internal buffer cleans up and can be ready for reading another abstract, etc. The last abstract length is 60,959 B. This example has vividly demonstrated the key advantages of stream usage. They help save memory. Whatever is the size our big file, we still handle only its small part at a moment. The second less obvious advantage is versatility of its interface. Here we use the stream ReadStream from the file. But we can replace it any time by any stream from our resource:

var stream = new OurStream("our resource");

It won’t need any change of the left code because streams are, first of all, our interface. So, it means, if theoretically our stream performs all needed events and methods – in particular, it inherits from stream.Readable – everything should be ok. Of course, it will happen only if we do not use any special abilities that only file streams have got. To be more specific, the stream fs.ReadStream has extra events

Here we can see a draft exactly for fs.ReadStream, new events are colored in red. First, it is a file opening, while the last event is its closure. Focus your attention on the fact that if a file is read till its end, the end event occurs followed by close. And if a file is not entirely read – for instance, because of an error or upon calling the destroy method – there will be no end because the file hasn’t been ended. But the event close is always ensured upon a file closure.

Finally, our last, but not least detail here is error handling. So, let us see what will happen, if there is no file.

var stream = new fs.ReadStream("noFile.html");

So, I get it launched. Oops! It crashed! Pay your attention to the fact the streams inherit from EventEmitter. If an error occurs, the whole Node.js process fails. It happens if an error of this kind does not have any handler. That’s why if we do not want our Node.js to fail because of an exception, we should install a handler:

var fs = require('fs');
 
// fs.ReadStream nherits from stream.Readable
var stream = new fs.ReadStream("noFile.html");
 
stream.on('readable', function() {
    var data = stream.read();
    if (data){
        console.log(data.length);
    }
    else {
        console.log('data is null')
    }
});
 
stream.on('error', function(err) {
    if (err.code == 'ENOENT') {
        console.log("File not Found");
    } else {
        console.error(err);
    }
});

So, we use streams to work with data sources in Node.js. Here we’ve analyzed a basic scheme, according to which they work, and a particular example – fs.ReadStream – that can read from a file.

This lesson’s coding can be found in our repository.