High Performance Data Analytics with Cube.js Pre-Aggregations

<em>This is an advanced tutorial. If you are just getting started with Cube.js, I recommend checking&nbsp;</em><a href="https://cube.dev/blog/cubejs-open-source-dashboard-framework-ultimate-guide/" target="_blank"><em>this tutorial</em></a><em>&nbsp;first and then coming back here.</em>

This is an advanced tutorial. If you are just getting started with Cube.js, I recommend checking this tutorial first and then coming back here.

One of the most powerful features of Cube.js is pre-aggregations. Coupled with data schema, it eliminates the need to organize, denormalize, and transform data before using it with Cube.js. The pre-aggregation engine builds a layer of aggregated data in your database during the runtime and maintains it to be up-to-date.

Upon an incoming request, Cube.js will first look for a relevant pre-aggregation. If it cannot find any, it will build a new one. Once the pre-aggregation is built, all the subsequent requests will go to the pre-aggregated layer instead of hitting the raw data. It could speed the response time by hundreds or even thousands of times.

Pre-aggregations are materialized query results persisted as tables. In order to start using pre-aggregations, Cube.js should have write access to the stb_pre_aggregations schema where pre-aggregation tables will be stored.

Cube.js also takes care of keeping the pre-aggregation up-to-date. It performs refresh checks and if it finds that a pre-aggregation is outdated, it schedules a refresh in the background.

Creating a Simple Pre-Aggregation

Let’s take a look at the example of how we can use pre-aggregations to improve query performance.

For testing purposes, we will use a Postgres database and will generate around ten million records using the generate_series function.

$ createdb cubejs_test

The following SQL creates a table, orders, and inserts a sample of generated records into it.

CREATE TABLE orders (
  id SERIAL PRIMARY KEY,
  amount integer,
  created_at timestamp without time zone
);
CREATE INDEX orders_created_at_amount ON orders(created_at, amount);

INSERT INTO orders (created_at, amount)
SELECT
created_at,
floor((1000 + 500*random())*log(row_number() over())) as amount
FROM generate_series
( '1997-01-01'::date
, '2017-12-31'::date
, '1 minutes'::interval) created_at

Next, create a new Cube.js application if you don’t have any.

$ npm install -g cube.js
$ cubejs create test-app -d postgres

Change the content of .env in the project folder to the following.

CUBEJS_API_SECRET=SECRET
CUBEJS_DB_TYPE=postgres
CUBEJS_DB_NAME=cubejs_test

Finally, generate a schema for the orders table and start the Cube.js server.

$  cubejs generate -t orders
$ npm run dev

Now, we can send a query to Cube.js with the Orders.count measure and Orders.createdAt time dimension with granularity set to month.

curl 
-H "Authorization: EXAMPLE-API-TOKEN"
-G
--data-urlencode 'query={
"measures" : ["Orders.amount"],
"timeDimensions":[{
"dimension": "Orders.createdAt",
"granularity": "month",
"dateRange": ["1997-01-01", "2017-01-01"]
}]
}'
http://localhost:4000/cubejs-api/v1/load

Cube.js will respond with Continue wait, because this query takes more than 5 seconds to process. Let’s look at Cube.js logs to see exactly how long it took for our Postgres to execute this query.

Performing query completed:
{
"queueSize":2,
"duration":6514,
"queryKey":[
"
SELECT
date_trunc('month', (orders.created_at::timestamptz at time zone 'UTC')) "orders.created_at_month",
sum(orders.amount) "orders.amount"
FROM
public.orders AS orders
WHERE (
orders.created_at >= $1::timestamptz
AND orders.created_at <= $2::timestamptz
)
GROUP BY 1
ORDER BY 1 ASC limit 10000
",
[
"2000-01-01T00:00:00Z",
"2017-01-01T23:59:59Z"
],
[]
]
}

It took 6,514 milliseconds (6.5 seconds) for Postgres to execute the above query. Although we have an index on the created_at and amount columns, it doesn't help a lot in this particular case since we're querying almost all the dates we have. The index would help if we query a smaller date range, but still, it would be a matter of seconds, not milliseconds.

We can significantly speed it up by adding a pre-aggregation layer. To do this, add the following preAggregations block to src/Orders.js:

preAggregations: {
amountByCreated: {
type: rollup,
measureReferences: [amount],
timeDimensionReference: createdAt,
granularity: month
}
}

The block above instructs Cube.js to build and use a rollup type of pre-aggregation when the “Orders.amount” measure and “Orders.createdAt” time dimension (with “month” granularity) are requested together. You can read more about pre-aggregation options in the documentation reference.

Now, once we send the same request, Cube.js will detect the pre-aggregation declaration and will start building it. Once it's built, it will query it and send the result back. All the subsequent queries will go to the pre-aggregation layer.

Here is how querying pre-aggregation looks in the Cube.js logs:

Performing query completed:
{
"queueSize":1,
"duration":5,
"queryKey":[
"
SELECT
"orders.created_at_month" "orders.created_at_month",
sum("orders.amount") "orders.amount"
FROM
stb_pre_aggregations.orders_amount_by_created
WHERE (
"orders.created_at_month" >= ($1::timestamptz::timestamptz AT TIME ZONE 'UTC')
AND
"orders.created_at_month" <= ($2::timestamptz::timestamptz AT TIME ZONE 'UTC')
)
GROUP BY 1 ORDER BY 1 ASC LIMIT 10000
",
[
"1995-01-01T00:00:00Z",
"2017-01-01T23:59:59Z"
],
[
[
"
CREATE TABLE
stb_pre_aggregations.orders_amount_by_created
AS SELECT
date_trunc('month', (orders.created_at::timestamptz AT TIME ZONE 'UTC')) "orders.created_at_month",
sum(orders.amount) "orders.amount"
FROM
public.orders AS orders
GROUP BY 1
",
[]
]
]
]
}

As you can see, now it takes only 5 milliseconds (1,300 times faster) to get the same data. Also, you can note that SQL has been changed and now it queries data from stb_pre_aggregations.orders_amount_by_created, which is the table generated by Cube.js to store pre-aggregation for this query. The second query is a DDL statement for this pre-aggregation table.

Pre-Aggregations Refresh

Cube.js also takes care of keeping pre-aggregations up to date. By default, every two minutes on a new request Cube.js will initiate the refresh check.

You can set up a custom refresh check strategy by using refreshKey. The default strategy works the following way:

  • Check the max of time dimensions with updated in the name, if none exist…
  • Check the max of any existing time dimension, if none exist…
  • Check the count of rows for this cube.

If the result of the refresh check is different from the last one, Cube.js will initiate the rebuild of the pre-aggregation in the background and then hot swap the old one.

Next Steps

This guide is the first step to learning about pre-aggregations and how to start using them in your project. But there is much more you can do with them. You can find the pre-aggregations documentation reference here.

Also, here are some highlights with useful links to help you along the way.

Pre-aggregate queries across multiple cubes

Pre-aggregations work not only for measures and dimensions inside the single cube, but also across multiple joined cubes as well. If you have joined cubes, you can reference measures and dimensions from any part of the join tree. The example below shows how the Users.countrydimension can be used with the Orders.count and Orders.revenuemeasures.

cube(Orders, {
sql: select * from orders,

joins: {
Users: {
relationship: belongsTo,
sql: ${CUBE}.user_id = ${Users}.id
}
},

// …

preAggregations: {
categoryAndDate: {
type: rollup,
measureReferences: [count, revenue],
dimensionReferences: [Users.country],
timeDimensionReference: createdAt,
granularity: day
}
}
});

Generate pre-aggregations dynamically

Since pre-aggregations are part of the data schema, which is basically a Javascript code, you can dynamically create all the required pre-aggregations. This guide covers how you can dynamically generate a Cube.js schema.

Time partitioning

You can instruct Cube.js to partition pre-aggregations by time using the partitionGranularity option. Cube.js will generate not a single table for the whole pre-aggregation, but a set of smaller tables. It can reduce the refresh time and cost in the case of BigQuery for example.

Time partitioning documentation reference.

preAggregations: {
categoryAndDate: {
type: rollup,
measureReferences: [count],
timeDimensionReference: createdAt,
granularity: day,
partitionGranularity: month
}
}

Data Cube Lattices

Cube.js can automatically build rollup pre-aggregations without the need to specify which measures and dimensions to use. It learns from query history and selects an optimal set of measures and dimensions for a given query. Under the hood it uses the Data Cube Lattices approach.

It is very useful if you need a lot of pre-aggregations and you don't know ahead of time which ones exactly. Using autoRollup will save you from coding manually all the possible aggregations.

You can find documentation for auto rollup here.

cube(Orders, {
sql: select * from orders,

preAggregations: {
main: {
type: autoRollup
}
}
});







Top 7 Most Popular Node.js Frameworks You Should Know

Top 7 Most Popular Node.js Frameworks You Should Know

Node.js is an open-source, cross-platform, runtime environment that allows developers to run JavaScript outside of a browser. In this post, you'll see top 7 of the most popular Node frameworks at this point in time (ranked from high to low by GitHub stars).

Node.js is an open-source, cross-platform, runtime environment that allows developers to run JavaScript outside of a browser.

One of the main advantages of Node is that it enables developers to use JavaScript on both the front-end and the back-end of an application. This not only makes the source code of any app cleaner and more consistent, but it significantly speeds up app development too, as developers only need to use one language.

Node is fast, scalable, and easy to get started with. Its default package manager is npm, which means it also sports the largest ecosystem of open-source libraries. Node is used by companies such as NASA, Uber, Netflix, and Walmart.

But Node doesn't come alone. It comes with a plethora of frameworks. A Node framework can be pictured as the external scaffolding that you can build your app in. These frameworks are built on top of Node and extend the technology's functionality, mostly by making apps easier to prototype and develop, while also making them faster and more scalable.

Below are 7of the most popular Node frameworks at this point in time (ranked from high to low by GitHub stars).

Express

With over 43,000 GitHub stars, Express is the most popular Node framework. It brands itself as a fast, unopinionated, and minimalist framework. Express acts as middleware: it helps set up and configure routes to send and receive requests between the front-end and the database of an app.

Express provides lightweight, powerful tools for HTTP servers. It's a great framework for single-page apps, websites, hybrids, or public HTTP APIs. It supports over fourteen different template engines, so developers aren't forced into any specific ORM.

Meteor

Meteor is a full-stack JavaScript platform. It allows developers to build real-time web apps, i.e. apps where code changes are pushed to all browsers and devices in real-time. Additionally, servers send data over the wire, instead of HTML. The client renders the data.

The project has over 41,000 GitHub stars and is built to power large projects. Meteor is used by companies such as Mazda, Honeywell, Qualcomm, and IKEA. It has excellent documentation and a strong community behind it.

Koa

Koa is built by the same team that built Express. It uses ES6 methods that allow developers to work without callbacks. Developers also have more control over error-handling. Koa has no middleware within its core, which means that developers have more control over configuration, but which means that traditional Node middleware (e.g. req, res, next) won't work with Koa.

Koa already has over 26,000 GitHub stars. The Express developers built Koa because they wanted a lighter framework that was more expressive and more robust than Express. You can find out more about the differences between Koa and Express here.

Sails

Sails is a real-time, MVC framework for Node that's built on Express. It supports auto-generated REST APIs and comes with an easy WebSocket integration.

The project has over 20,000 stars on GitHub and is compatible with almost all databases (MySQL, MongoDB, PostgreSQL, Redis). It's also compatible with most front-end technologies (Angular, iOS, Android, React, and even Windows Phone).

Nest

Nest has over 15,000 GitHub stars. It uses progressive JavaScript and is built with TypeScript, which means it comes with strong typing. It combines elements of object-oriented programming, functional programming, and functional reactive programming.

Nest is packaged in such a way it serves as a complete development kit for writing enterprise-level apps. The framework uses Express, but is compatible with a wide range of other libraries.

LoopBack

LoopBack is a framework that allows developers to quickly create REST APIs. It has an easy-to-use CLI wizard and allows developers to create models either on their schema or dynamically. It also has a built-in API explorer.

LoopBack has over 12,000 GitHub stars and is used by companies such as GoDaddy, Symantec, and the Bank of America. It's compatible with many REST services and a wide variety of databases (MongoDB, Oracle, MySQL, PostgreSQL).

Hapi

Similar to Express, hapi serves data by intermediating between server-side and client-side. As such, it's can serve as a substitute for Express. Hapi allows developers to focus on writing reusable app logic in a modular and prescriptive fashion.

The project has over 11,000 GitHub stars. It has built-in support for input validation, caching, authentication, and more. Hapi was originally developed to handle all of Walmart's mobile traffic during Black Friday.

20. Node.js Lessons. Data Streams in Node.JS, fs.ReadStream

20. Node.js Lessons. Data Streams in Node.JS, fs.ReadStream

20. Node.js Lessons. Data Streams in Node.JS, fs.ReadStream

Hey all! Our topic for today is Data Streams In Node.js. We will try to learn all the aspects in details for the reason it turns out that on the one hand, common browser JavaScript development lack streams. And on the other hand, knowing and understanding stream principles is necessary for seamless server development because a stream is a versatile way of work with data sources universally used.

We can define two general stream types. The first one is

stream.Readable

It is a built-in class providing streams for reading. Generally, this type itself is never used, while its descendants are quite popular – in particular, we use fs.ReadStream to read from a file. To read from a visitor’s request for its handling, there is a special object familiar to us under its name req, which is the first argument of a request handler.

Stream.Writable

It is a versatile writing method. The very stream.Writable is rarely used, but its descendants – fs.WriteStream and res – are quite common.

There are some other stream types, but the most popular are these two and their variations.

The best way to understand streams is to see how they work in practice. So, right now we’ll start with using fs.ReadStream for reading a file. Let us create a file fs.js:

var fs = require('fs');
 
// fs.ReadStream nherits from stream.Readable
var stream = new fs.ReadStream(__filename);
 
stream.on('readable', function() {
    var data = stream.read();
    console.log(data);
});
 
stream.on('end', function() {
    console.log("THE END");
});

So, we get the module fs connected and create a stream:

var fs = require('fs');
 
var stream = new fs.ReadStream(__filename);

Stream is a JavaScript object receiving information about our resource – in our case, it is a path to the file (__filename) – which can work with this resource. fs.ReadStream implements a standard reading interface described in the stream.Readable class. Let us have a detailed look.

When a stream object new stream.Readable is created, it gets connected to the data source, which is file in our case, and tries to start reading from it. Once it has read something, it imitates the event readable. This event means that all the data have been computed and are contained within an inner stream buffer that can be received using the call read(). Then we can do something with data and wait for the next readable. This cycle will be the same.

Whenever the data source gets empty (however, there are certain sources that never get empty – for example, a random data generator), the file size is limited, so we will have the end, event in the very end meaning there will be no data anymore. Moreover we can call the method destroy() at any step of working with the stream. This method means we do not need the stream anymore and it can be closed, as well as the respective data sources and everything can be cleaned up.

So, let us refer to the original code. Here we create ReadStream, and it immediately wants to open up a file:

var stream = new fs.ReadStream(__filename);

but in our case it doesn’t necessarily mean the same string because any input/output-related operation is performed through libUV. At the same time, libUV has a structure that enables all synchronous input/output handlers to get implemented during the next event loop iteration, or once the current JavaScripthas finished its work. It means, we can seamlessly use all handlers knowing that they will be installed prior to the moment the first data fragment gets read. Launch fs.js.

Look at what has appeared in the console. The first one was the event readable. It outputted data. Right now it is an ordinary buffer, but we can transform it to the string by specifying the coding directly upon the stream opening.

var stream = new fs.ReadStream(__filename, {encoding: 'utf-8'});

Thus, the modification will be automatic. When a file ends, the event endoutputs THE END in the console. Here the file ended almost immediately because it was small at the moment. Let us modify our example a little bit by making a file big.html out of the current file contained in the current directory. Download this HTML file from our repository together with the other lesson materials.

Launch it. The file big.html is big, so the event readable has been initiated several times, and every time we received another data fragment as a buffer. So, let us calculate its length:

var fs = require('fs');
 
// fs.ReadStream nherits from stream.Readable
var stream = new fs.ReadStream("big.html");
 
stream.on('readable', function() {
    var data = stream.read();
    if (data){
        console.log(data.length);
    }
    else {
        console.log('data is null')
    }
});
 
stream.on('end', function() {
    console.log("THE END");
});

Get it launched. These numbers are the read file fragment length. When a stream opens a file, it reads only its part, but not the whole file, and inserts it into its internal variable. The maximum size is exactly 64 KB. Until we call stream.Read, it won’t read further. Once we’ve received the data, the internal buffer cleans up and can be ready for reading another abstract, etc. The last abstract length is 60,959 B. This example has vividly demonstrated the key advantages of stream usage. They help save memory. Whatever is the size our big file, we still handle only its small part at a moment. The second less obvious advantage is versatility of its interface. Here we use the stream ReadStream from the file. But we can replace it any time by any stream from our resource:

var stream = new OurStream("our resource");

It won’t need any change of the left code because streams are, first of all, our interface. So, it means, if theoretically our stream performs all needed events and methods – in particular, it inherits from stream.Readable – everything should be ok. Of course, it will happen only if we do not use any special abilities that only file streams have got. To be more specific, the stream fs.ReadStream has extra events

Here we can see a draft exactly for fs.ReadStream, new events are colored in red. First, it is a file opening, while the last event is its closure. Focus your attention on the fact that if a file is read till its end, the end event occurs followed by close. And if a file is not entirely read – for instance, because of an error or upon calling the destroy method – there will be no end because the file hasn’t been ended. But the event close is always ensured upon a file closure.

Finally, our last, but not least detail here is error handling. So, let us see what will happen, if there is no file.

var stream = new fs.ReadStream("noFile.html");

So, I get it launched. Oops! It crashed! Pay your attention to the fact the streams inherit from EventEmitter. If an error occurs, the whole Node.js process fails. It happens if an error of this kind does not have any handler. That’s why if we do not want our Node.js to fail because of an exception, we should install a handler:

var fs = require('fs');
 
// fs.ReadStream nherits from stream.Readable
var stream = new fs.ReadStream("noFile.html");
 
stream.on('readable', function() {
    var data = stream.read();
    if (data){
        console.log(data.length);
    }
    else {
        console.log('data is null')
    }
});
 
stream.on('error', function(err) {
    if (err.code == 'ENOENT') {
        console.log("File not Found");
    } else {
        console.error(err);
    }
});

So, we use streams to work with data sources in Node.js. Here we’ve analyzed a basic scheme, according to which they work, and a particular example – fs.ReadStream – that can read from a file.

This lesson’s coding can be found in our repository.