High Performance Data Analytics with Cube.js Pre-Aggregations

This is an advanced tutorial. If you are just getting started with Cube.js, I recommend checking this tutorial first and then coming back here.

One of the most powerful features of Cube.js is pre-aggregations. Coupled with data schema, it eliminates the need to organize, denormalize, and transform data before using it with Cube.js. The pre-aggregation engine builds a layer of aggregated data in your database during the runtime and maintains it to be up-to-date.

Upon an incoming request, Cube.js will first look for a relevant pre-aggregation. If it cannot find any, it will build a new one. Once the pre-aggregation is built, all the subsequent requests will go to the pre-aggregated layer instead of hitting the raw data. It could speed the response time by hundreds or even thousands of times.

Pre-aggregations are materialized query results persisted as tables. In order to start using pre-aggregations, Cube.js should have write access to the stb_pre_aggregations schema where pre-aggregation tables will be stored.

Cube.js also takes care of keeping the pre-aggregation up-to-date. It performs refresh checks and if it finds that a pre-aggregation is outdated, it schedules a refresh in the background.

Creating a Simple Pre-Aggregation

Let’s take a look at the example of how we can use pre-aggregations to improve query performance.

For testing purposes, we will use a Postgres database and will generate around ten million records using the generate_series function.

$ createdb cubejs_test

The following SQL creates a table, orders, and inserts a sample of generated records into it.

CREATE TABLE orders (
  id SERIAL PRIMARY KEY,
  amount integer,
  created_at timestamp without time zone
);
CREATE INDEX orders_created_at_amount ON orders(created_at, amount);

INSERT INTO orders (created_at, amount)
SELECT
created_at,
floor((1000 + 500*random())*log(row_number() over())) as amount
FROM generate_series
( ‘1997-01-01’::date
, ‘2017-12-31’::date
, ‘1 minutes’::interval) created_at

Next, create a new Cube.js application if you don’t have any.

$ npm install -g cube.js
$ cubejs create test-app -d postgres

Change the content of .env in the project folder to the following.

CUBEJS_API_SECRET=SECRET
CUBEJS_DB_TYPE=postgres
CUBEJS_DB_NAME=cubejs_test

Finally, generate a schema for the orders table and start the Cube.js server.

$  cubejs generate -t orders
$ npm run dev

Now, we can send a query to Cube.js with the Orders.count measure and Orders.createdAt time dimension with granularity set to month.

curl 
-H “Authorization: EXAMPLE-API-TOKEN”
-G
–data-urlencode ‘query={
“measures” : [“Orders.amount”],
“timeDimensions”:[{
“dimension”: “Orders.createdAt”,
“granularity”: “month”,
“dateRange”: [“1997-01-01”, “2017-01-01”]
}]
}’
http://localhost:4000/cubejs-api/v1/load

Cube.js will respond with Continue wait, because this query takes more than 5 seconds to process. Let’s look at Cube.js logs to see exactly how long it took for our Postgres to execute this query.

Performing query completed:
{
“queueSize”:2,
“duration”:6514,
“queryKey”:[
"
SELECT
date_trunc(‘month’, (orders.created_at::timestamptz at time zone ‘UTC’)) “orders.created_at_month”,
sum(orders.amount) “orders.amount”
FROM
public.orders AS orders
WHERE (
orders.created_at >= $1::timestamptz
AND orders.created_at <= $2::timestamptz
)
GROUP BY 1
ORDER BY 1 ASC limit 10000
",
[
“2000-01-01T00:00:00Z”,
“2017-01-01T23:59:59Z”
],
[]
]
}

It took 6,514 milliseconds (6.5 seconds) for Postgres to execute the above query. Although we have an index on the created_at and amount columns, it doesn’t help a lot in this particular case since we’re querying almost all the dates we have. The index would help if we query a smaller date range, but still, it would be a matter of seconds, not milliseconds.

We can significantly speed it up by adding a pre-aggregation layer. To do this, add the following preAggregations block to src/Orders.js:

preAggregations: {
amountByCreated: {
type: rollup,
measureReferences: [amount],
timeDimensionReference: createdAt,
granularity: month
}
}

The block above instructs Cube.js to build and use a rollup type of pre-aggregation when the “Orders.amount” measure and “Orders.createdAt” time dimension (with “month” granularity) are requested together. You can read more about pre-aggregation options in the documentation reference.

Now, once we send the same request, Cube.js will detect the pre-aggregation declaration and will start building it. Once it’s built, it will query it and send the result back. All the subsequent queries will go to the pre-aggregation layer.

Here is how querying pre-aggregation looks in the Cube.js logs:

Performing query completed:
{
“queueSize”:1,
“duration”:5,
“queryKey”:[
"
SELECT
“orders.created_at_month” “orders.created_at_month”,
sum(“orders.amount”) “orders.amount”
FROM
stb_pre_aggregations.orders_amount_by_created
WHERE (
“orders.created_at_month” >= ($1::timestamptz::timestamptz AT TIME ZONE ‘UTC’)
AND
“orders.created_at_month” <= ($2::timestamptz::timestamptz AT TIME ZONE ‘UTC’)
)
GROUP BY 1 ORDER BY 1 ASC LIMIT 10000
",
[
“1995-01-01T00:00:00Z”,
“2017-01-01T23:59:59Z”
],
[
[
"
CREATE TABLE
stb_pre_aggregations.orders_amount_by_created
AS SELECT
date_trunc(‘month’, (orders.created_at::timestamptz AT TIME ZONE ‘UTC’)) “orders.created_at_month”,
sum(orders.amount) “orders.amount”
FROM
public.orders AS orders
GROUP BY 1
",
[]
]
]
]
}

As you can see, now it takes only 5 milliseconds (1,300 times faster) to get the same data. Also, you can note that SQL has been changed and now it queries data from stb_pre_aggregations.orders_amount_by_created, which is the table generated by Cube.js to store pre-aggregation for this query. The second query is a DDL statement for this pre-aggregation table.

Pre-Aggregations Refresh

Cube.js also takes care of keeping pre-aggregations up to date. By default, every two minutes on a new request Cube.js will initiate the refresh check.

You can set up a custom refresh check strategy by using refreshKey. The default strategy works the following way:

  • Check the max of time dimensions with updated in the name, if none exist…
  • Check the max of any existing time dimension, if none exist…
  • Check the count of rows for this cube.

If the result of the refresh check is different from the last one, Cube.js will initiate the rebuild of the pre-aggregation in the background and then hot swap the old one.

Next Steps

This guide is the first step to learning about pre-aggregations and how to start using them in your project. But there is much more you can do with them. You can find the pre-aggregations documentation reference here.

Also, here are some highlights with useful links to help you along the way.

Pre-aggregate queries across multiple cubes

Pre-aggregations work not only for measures and dimensions inside the single cube, but also across multiple joined cubes as well. If you have joined cubes, you can reference measures and dimensions from any part of the join tree. The example below shows how the Users.countrydimension can be used with the Orders.count and Orders.revenuemeasures.

cube(Orders, {
sql: select * from orders,

joins: {
Users: {
relationship: belongsTo,
sql: ${CUBE}.user_id = ${Users}.id
}
},

// …

preAggregations: {
categoryAndDate: {
type: rollup,
measureReferences: [count, revenue],
dimensionReferences: [Users.country],
timeDimensionReference: createdAt,
granularity: day
}
}
});

Generate pre-aggregations dynamically

Since pre-aggregations are part of the data schema, which is basically a Javascript code, you can dynamically create all the required pre-aggregations. This guide covers how you can dynamically generate a Cube.js schema.

Time partitioning

You can instruct Cube.js to partition pre-aggregations by time using the partitionGranularity option. Cube.js will generate not a single table for the whole pre-aggregation, but a set of smaller tables. It can reduce the refresh time and cost in the case of BigQuery for example.

Time partitioning documentation reference.

preAggregations: {
categoryAndDate: {
type: rollup,
measureReferences: [count],
timeDimensionReference: createdAt,
granularity: day,
partitionGranularity: month
}
}

Data Cube Lattices

Cube.js can automatically build rollup pre-aggregations without the need to specify which measures and dimensions to use. It learns from query history and selects an optimal set of measures and dimensions for a given query. Under the hood it uses the Data Cube Lattices approach.

It is very useful if you need a lot of pre-aggregations and you don’t know ahead of time which ones exactly. Using autoRollup will save you from coding manually all the possible aggregations.

You can find documentation for auto rollup here.

cube(Orders, {
sql: select * from orders,

preAggregations: {
main: {
type: autoRollup
}
}
});







#data-analysis #node-js #sql #javascript

What is GEEK

Buddha Community

Siphiwe  Nair

Siphiwe Nair

1620466520

Your Data Architecture: Simple Best Practices for Your Data Strategy

If you accumulate data on which you base your decision-making as an organization, you should probably think about your data architecture and possible best practices.

If you accumulate data on which you base your decision-making as an organization, you most probably need to think about your data architecture and consider possible best practices. Gaining a competitive edge, remaining customer-centric to the greatest extent possible, and streamlining processes to get on-the-button outcomes can all be traced back to an organization’s capacity to build a future-ready data architecture.

In what follows, we offer a short overview of the overarching capabilities of data architecture. These include user-centricity, elasticity, robustness, and the capacity to ensure the seamless flow of data at all times. Added to these are automation enablement, plus security and data governance considerations. These points from our checklist for what we perceive to be an anticipatory analytics ecosystem.

#big data #data science #big data analytics #data analysis #data architecture #data transformation #data platform #data strategy #cloud data platform #data acquisition

Gerhard  Brink

Gerhard Brink

1620629020

Getting Started With Data Lakes

Frameworks for Efficient Enterprise Analytics

The opportunities big data offers also come with very real challenges that many organizations are facing today. Often, it’s finding the most cost-effective, scalable way to store and process boundless volumes of data in multiple formats that come from a growing number of sources. Then organizations need the analytical capabilities and flexibility to turn this data into insights that can meet their specific business objectives.

This Refcard dives into how a data lake helps tackle these challenges at both ends — from its enhanced architecture that’s designed for efficient data ingestion, storage, and management to its advanced analytics functionality and performance flexibility. You’ll also explore key benefits and common use cases.

Introduction

As technology continues to evolve with new data sources, such as IoT sensors and social media churning out large volumes of data, there has never been a better time to discuss the possibilities and challenges of managing such data for varying analytical insights. In this Refcard, we dig deep into how data lakes solve the problem of storing and processing enormous amounts of data. While doing so, we also explore the benefits of data lakes, their use cases, and how they differ from data warehouses (DWHs).


This is a preview of the Getting Started With Data Lakes Refcard. To read the entire Refcard, please download the PDF from the link above.

#big data #data analytics #data analysis #business analytics #data warehouse #data storage #data lake #data lake architecture #data lake governance #data lake management

Ian  Robinson

Ian Robinson

1624399200

Top 10 Big Data Tools for Data Management and Analytics

Introduction to Big Data

What exactly is Big Data? Big Data is nothing but large and complex data sets, which can be both structured and unstructured. Its concept encompasses the infrastructures, technologies, and Big Data Tools created to manage this large amount of information.

To fulfill the need to achieve high-performance, Big Data Analytics tools play a vital role. Further, various Big Data tools and frameworks are responsible for retrieving meaningful information from a huge set of data.

List of Big Data Tools & Frameworks

The most important as well as popular Big Data Analytics Open Source Tools which are used in 2020 are as follows:

  1. Big Data Framework
  2. Data Storage Tools
  3. Data Visualization Tools
  4. Big Data Processing Tools
  5. Data Preprocessing Tools
  6. Data Wrangling Tools
  7. Big Data Testing Tools
  8. Data Governance Tools
  9. Security Management Tools
  10. Real-Time Data Streaming Tools

#big data engineering #top 10 big data tools for data management and analytics #big data tools for data management and analytics #tools for data management #analytics #top big data tools for data management and analytics

NBB: Ad-hoc CLJS Scripting on Node.js

Nbb

Not babashka. Node.js babashka!?

Ad-hoc CLJS scripting on Node.js.

Status

Experimental. Please report issues here.

Goals and features

Nbb's main goal is to make it easy to get started with ad hoc CLJS scripting on Node.js.

Additional goals and features are:

  • Fast startup without relying on a custom version of Node.js.
  • Small artifact (current size is around 1.2MB).
  • First class macros.
  • Support building small TUI apps using Reagent.
  • Complement babashka with libraries from the Node.js ecosystem.

Requirements

Nbb requires Node.js v12 or newer.

How does this tool work?

CLJS code is evaluated through SCI, the same interpreter that powers babashka. Because SCI works with advanced compilation, the bundle size, especially when combined with other dependencies, is smaller than what you get with self-hosted CLJS. That makes startup faster. The trade-off is that execution is less performant and that only a subset of CLJS is available (e.g. no deftype, yet).

Usage

Install nbb from NPM:

$ npm install nbb -g

Omit -g for a local install.

Try out an expression:

$ nbb -e '(+ 1 2 3)'
6

And then install some other NPM libraries to use in the script. E.g.:

$ npm install csv-parse shelljs zx

Create a script which uses the NPM libraries:

(ns script
  (:require ["csv-parse/lib/sync$default" :as csv-parse]
            ["fs" :as fs]
            ["path" :as path]
            ["shelljs$default" :as sh]
            ["term-size$default" :as term-size]
            ["zx$default" :as zx]
            ["zx$fs" :as zxfs]
            [nbb.core :refer [*file*]]))

(prn (path/resolve "."))

(prn (term-size))

(println (count (str (fs/readFileSync *file*))))

(prn (sh/ls "."))

(prn (csv-parse "foo,bar"))

(prn (zxfs/existsSync *file*))

(zx/$ #js ["ls"])

Call the script:

$ nbb script.cljs
"/private/tmp/test-script"
#js {:columns 216, :rows 47}
510
#js ["node_modules" "package-lock.json" "package.json" "script.cljs"]
#js [#js ["foo" "bar"]]
true
$ ls
node_modules
package-lock.json
package.json
script.cljs

Macros

Nbb has first class support for macros: you can define them right inside your .cljs file, like you are used to from JVM Clojure. Consider the plet macro to make working with promises more palatable:

(defmacro plet
  [bindings & body]
  (let [binding-pairs (reverse (partition 2 bindings))
        body (cons 'do body)]
    (reduce (fn [body [sym expr]]
              (let [expr (list '.resolve 'js/Promise expr)]
                (list '.then expr (list 'clojure.core/fn (vector sym)
                                        body))))
            body
            binding-pairs)))

Using this macro we can look async code more like sync code. Consider this puppeteer example:

(-> (.launch puppeteer)
      (.then (fn [browser]
               (-> (.newPage browser)
                   (.then (fn [page]
                            (-> (.goto page "https://clojure.org")
                                (.then #(.screenshot page #js{:path "screenshot.png"}))
                                (.catch #(js/console.log %))
                                (.then #(.close browser)))))))))

Using plet this becomes:

(plet [browser (.launch puppeteer)
       page (.newPage browser)
       _ (.goto page "https://clojure.org")
       _ (-> (.screenshot page #js{:path "screenshot.png"})
             (.catch #(js/console.log %)))]
      (.close browser))

See the puppeteer example for the full code.

Since v0.0.36, nbb includes promesa which is a library to deal with promises. The above plet macro is similar to promesa.core/let.

Startup time

$ time nbb -e '(+ 1 2 3)'
6
nbb -e '(+ 1 2 3)'   0.17s  user 0.02s system 109% cpu 0.168 total

The baseline startup time for a script is about 170ms seconds on my laptop. When invoked via npx this adds another 300ms or so, so for faster startup, either use a globally installed nbb or use $(npm bin)/nbb script.cljs to bypass npx.

Dependencies

NPM dependencies

Nbb does not depend on any NPM dependencies. All NPM libraries loaded by a script are resolved relative to that script. When using the Reagent module, React is resolved in the same way as any other NPM library.

Classpath

To load .cljs files from local paths or dependencies, you can use the --classpath argument. The current dir is added to the classpath automatically. So if there is a file foo/bar.cljs relative to your current dir, then you can load it via (:require [foo.bar :as fb]). Note that nbb uses the same naming conventions for namespaces and directories as other Clojure tools: foo-bar in the namespace name becomes foo_bar in the directory name.

To load dependencies from the Clojure ecosystem, you can use the Clojure CLI or babashka to download them and produce a classpath:

$ classpath="$(clojure -A:nbb -Spath -Sdeps '{:aliases {:nbb {:replace-deps {com.github.seancorfield/honeysql {:git/tag "v2.0.0-rc5" :git/sha "01c3a55"}}}}}')"

and then feed it to the --classpath argument:

$ nbb --classpath "$classpath" -e "(require '[honey.sql :as sql]) (sql/format {:select :foo :from :bar :where [:= :baz 2]})"
["SELECT foo FROM bar WHERE baz = ?" 2]

Currently nbb only reads from directories, not jar files, so you are encouraged to use git libs. Support for .jar files will be added later.

Current file

The name of the file that is currently being executed is available via nbb.core/*file* or on the metadata of vars:

(ns foo
  (:require [nbb.core :refer [*file*]]))

(prn *file*) ;; "/private/tmp/foo.cljs"

(defn f [])
(prn (:file (meta #'f))) ;; "/private/tmp/foo.cljs"

Reagent

Nbb includes reagent.core which will be lazily loaded when required. You can use this together with ink to create a TUI application:

$ npm install ink

ink-demo.cljs:

(ns ink-demo
  (:require ["ink" :refer [render Text]]
            [reagent.core :as r]))

(defonce state (r/atom 0))

(doseq [n (range 1 11)]
  (js/setTimeout #(swap! state inc) (* n 500)))

(defn hello []
  [:> Text {:color "green"} "Hello, world! " @state])

(render (r/as-element [hello]))

Promesa

Working with callbacks and promises can become tedious. Since nbb v0.0.36 the promesa.core namespace is included with the let and do! macros. An example:

(ns prom
  (:require [promesa.core :as p]))

(defn sleep [ms]
  (js/Promise.
   (fn [resolve _]
     (js/setTimeout resolve ms))))

(defn do-stuff
  []
  (p/do!
   (println "Doing stuff which takes a while")
   (sleep 1000)
   1))

(p/let [a (do-stuff)
        b (inc a)
        c (do-stuff)
        d (+ b c)]
  (prn d))
$ nbb prom.cljs
Doing stuff which takes a while
Doing stuff which takes a while
3

Also see API docs.

Js-interop

Since nbb v0.0.75 applied-science/js-interop is available:

(ns example
  (:require [applied-science.js-interop :as j]))

(def o (j/lit {:a 1 :b 2 :c {:d 1}}))

(prn (j/select-keys o [:a :b])) ;; #js {:a 1, :b 2}
(prn (j/get-in o [:c :d])) ;; 1

Most of this library is supported in nbb, except the following:

  • destructuring using :syms
  • property access using .-x notation. In nbb, you must use keywords.

See the example of what is currently supported.

Examples

See the examples directory for small examples.

Also check out these projects built with nbb:

API

See API documentation.

Migrating to shadow-cljs

See this gist on how to convert an nbb script or project to shadow-cljs.

Build

Prequisites:

  • babashka >= 0.4.0
  • Clojure CLI >= 1.10.3.933
  • Node.js 16.5.0 (lower version may work, but this is the one I used to build)

To build:

  • Clone and cd into this repo
  • bb release

Run bb tasks for more project-related tasks.

Download Details:
Author: borkdude
Download Link: Download The Source Code
Official Website: https://github.com/borkdude/nbb 
License: EPL-1.0

#node #javascript

Big Data Analytics: Unrefined Data to Smarter Business Insights - TopDevelopers.co

For Big Data Analytics, the challenges faced by businesses are unique and so will be the solution required to help access the full potential of Big Data.
Let’s take a look at the Top Big Data Analytics Challenges faced by Businesses and their Solutions.

#big data analytics challenges #big data analytics #data management #data analytics strategy #business solutions by big data #top big data analytics companies