Lawrence  Lesch

Lawrence Lesch


Querybook: A Big Data Querying UI, Combining Collocated Table Metadata


Querybook is a Big Data IDE that allows you to discover, create, and share data analyses, queries, and tables. 


  • 📚 Organize analyses with rich text, queries, and charts
  • ✏️ Compose queries with autocompletion and hovering tooltip
  • 📈 Use scheduling + charting in DataDocs to build dashboards
  • 🙌 Live query collaborations with others
  • 📝 Add additional documentation to your tables
  • 🧮 Get lineage, sample queries, frequent user, search ranking based on past query runs

Getting started


Please install Docker before trying out Querybook.

Quick setup

Pull this repo and run make. Visit https://localhost:10001 when the build completes.

For more details on installation, click here


For infrastructure configuration, click here For general configuration, click here

Supported Integrations

Query Engines


  • User/Password
  • OAuth
    • Google Cloud OAuth
    • Okta OAuth
    • GitHub OAuth
  • LDAP


Can be used to fetch schema and table information for metadata enrichment.

  • Hive Metastore
  • Sqlalchemy Inspect
  • AWS Glue Data Catalog

Result Storage

Use one of the following to store query results.

  • Database (MySQL, Postgres, etc)
  • S3
  • Google Cloud Storage
  • Local file

Result Export

Upload query results from Querybook to other tools for further analyses.

  • Google Sheets Export
  • Python export


Get notified upon completion of queries and DataDoc invitations via IM or email.

  • Email
  • Slack

User Interface

Query Editor editor.gif

Charting visualization.gif


Lineage & Analytics analytics.gif

Contributing Back


Check out the full documentation & feature highlights here.

Download Details:

Author: Pinterest
Source Code: 
License: Apache-2.0 license

#typescript #flask #presto #hive #notebook 

Querybook: A Big Data Querying UI, Combining Collocated Table Metadata
Justen  Hintz

Justen Hintz


Big Data & Hadoop for Beginners - Full Course in 12 Hours

This Big Data & Hadoop full course will help you understand and learn Hadoop concepts in detail. You'll learn: Introduction to Big Data, Hadoop Fundamentals, HDFS, MapReduce, Sqoop, Flume, Pig, Hive, NoSQL-HBase, Oozie, Hadoop Projects, Career in Big Data Domain, Big Data Hadoop Interview Q and A

Big Data & Hadoop Full Course In 12 Hours | BigData Hadoop Tutorial For Beginners 

This Edureka Big Data & Hadoop Full Course video will help you understand and learn Hadoop concepts in detail. This Big Data & Hadoop Tutorial is ideal for both beginners as well as professionals who want to master the Hadoop Ecosystem. Below are the topics covered in this Big Data Full Course:

  • Introduction to Big Data 
  • Hadoop Fundamentals
  • HDFS
  • MapReduce
  • Sqoop
  • Flume
  • Pig
  • Hive 
  • NoSQL-HBase
  • Oozie
  • Hadoop Projects
  • Career in Big Data Domain 
  • Big Data Hadoop Interview Q and A

#bigdata #hadoop #pig #hive #nosql 

Big Data & Hadoop for Beginners - Full Course in 12 Hours
Nigel  Uys

Nigel Uys


Tutorial Geospatial Analytics using Presto and Hive

Introduction to Geospatial Analytics

Geospatial Analytics is related to data that is used for locating anything on the globe, an uber driver to a man in a new neighbourhood place everybody uses its data in some way or the other. Its technology involves GPS (global positioning systems), GIS (geographical information systems), and RS (remote sensing). This blog we will explore the topic in depth. We start with the basics and then deep dive into all the details.

Why is it important?

It is necessary for so many things and is used daily for various reasons. From commuting purposes for an ordinary man to data in missiles of a defence organization of a particular county, everything requires its data. It is extracted from various resources. Every phone having an active internet connection somehow adds up to contributing to geospatial data, satellites collect data daily. It is of great use in everyday life, and so it requires a significant amount of attention. It can be used for various reasons, to help support natural hazards and, to know of disasters, global climate change, wildlife, natural resources, etc. It is used for satellite imagery too that could be for tactical or for weather forecasting purposes. Many tech giants like uber etc. use it on daily bases to help ease everyday life. A company has to be efficient in extracting the data efficiently and use it, to stand out in the market. 

How to retrieve Geospatial Data?

Various methods could do this, but mainly Presto and hives are used to extract and reform the data that's present in hundreds of petabyte and use it efficiently and make the lives of billions easy. This data is vital as it touches the mass majority and is used every second. GIS is a part of its data that helps in the collection, storage, manipulation, analyzation, and present spatial data. Whatever the situation is going on at local, regional or national level, if where is asked for it come to play. It wouldn't be effective without Visualization. 

Geospatial Analytics Using Presto

Presto is an open-source distributed SQL query, used to solve the question of any size or type. It runs on Hadoop. It supports many non-relational resources and Teradata. It can query data on its respective location, without moving the actual data to any separate system. The execution of the query runs parallel over a pure memory-based architecture, with most results returning within seconds. Many tech giants use it. It's a popular choice for undertaking interactive queries that are in data ranging in100s of PetaByte.

Geospatial Analytics Using Hive

It is a data warehouse infrastructure tool to process any structured data and developed on top of the Hadoop distributed file system. It resides on top of Hadoop to summarize Big Data and makes querying and analyzing of any kind of data accessible.

What is the architecture of Hive?

It is an ETL and Data Warehousing tool built on top of the Hadoop. It helps to perform many operations secure like :

  • Analysis of large data sets
  • Data encapsulation
  • Ad-hoc queries

What are its major components?

  1. Client
  2. Services
  3. Processing & Resource Management
  4. Distributed Storage

Hive Clients

 It supports all the application written in languages like Java, Python, C++ etc. It is using Thrift, JDBC and ODBC drivers. It's easy to write its client application in the desired language. Its clients are categorized into three types:-

  • Thrift Clients: Apache Hive's servers are based on Thrift, so it's easy for it to serve all the request from the languages that support Thrift
  • JDBC Clients: It allows java apps to connect to it by using its JDBC driver
  • ODBC Clients: ODBC Driver will enable applications that support ODBC protocol to connect to it. It uses Thrift to communicate to its server.

Hive Services

 It provides with various services like -

  1. CLI(Command Line Interface) – It is the default shell provided by it, which helps to execute its queries and command directly.
  2. Web Interface – It gives an option to execute queries and commands on a web-based GUI provided by it.
  3. Server – It is built on Apache Thrift and is also knows as Thrift Server. It allows different clients to submit requests and retrieve the final result from it.
  4. Driver – It is responsible for receiving the queries submitted by clients. It compiles, optimizes and executes the queries.

What is the architecture of Presto?

There is two central part in it: Coordinator and Worker. It is an open-source distributed system that can be run on multiple machines. Its distributed SQL query engine was built for fast analytic queries. Its deployment will include one Coordinator and any number of it.

  • Coordinator – Used to submit queries and manages parsing, planning, and scheduling query processing. 
  • Worker – Processes the queries, adding more workers gives faster query processing.

What are its key components?

The key components of presto are:


It is the brain of any installation; it manages all the worker nodes for all the work comes related to queries. It gets results from workers and returns the final output to the client. It connects with workers and clients via REST.


It helps to execute the task and to process the data. These nodes share data amongst each other and get data from the Coordinator.


It contains information related to data, such as where the data is located, where the schema is located and the data source. 

Tables and Schemas

It is similar to what it means in a relational database. The table is set of rows organized into named columns and schema is what you use to hold your tables.


lt issued to help it to integrate with the external data source.


To execute a query, Presto breaks it up into steps.


Stages are implemented as a series of functions that might get distributed on Workers.

Drivers and Operators

Tasks contains one or more parallel drivers, and they are operators in memory. An operator consumes, transforms and produces data.

What are the deployment strategies?

The deployment strategies for Hive are listed below:


Amazon EMR is used to deploy its megastore. User can opt from three configurations that Amazon has to offer, namely – Embedded, Local or Remote.  There are two options for creating an external Hive megastore for EMR:

  1. By using AWS Glue data catalogue
  2. Use Amazon RDS / Amazon Aurora

Cloud Dataproc

Apache Hive on Cloud Dataproc provides an efficient and flexible way by storing data of it in Cloud Storage and hosting its metastore in MySQL database on the Cloud SQL. It offers some advantages like flexibility and agility by letting user tailor cluster configuration for specific workloads and scale the cluster according to the need. It also helps in saving cost.

The deployment strategies for Presto


Amazon EMR allows to quickly spin up a managed EMR cluster with a presto query engine and run interactive analysis on the data stored in Amazon S3. It is used to run interactive queries. Its implementation can be built on the cloud on Amazon Web Services. Amazon EMR and Amazon Athena provides with building and implementation of it.

Cloud Dataproc

The cluster that includes its component can easily prepare in Presto. 

What are the various ways to optimise?

The various ways to optimise are described below:


  1. Tez-Execution Engine  – It is an application framework built on Hadoop Yarn. 
  2. Usage of Suitable File Format – Usage of appropriate file format on the basis of data will drastically increase the query performance. ORC file format is best suited for the same.
  3. Partitioning – By partitioning the entries into the different dataset, only the required data is called during the time of the execution of the query, thus making the performance more efficient and optimized.
  4. Bucketing – It helps divide the datasets into more manageable parts, for this purpose bucketing is used. User can set the size of manageable pieces or Buckets too.
  5. Vectorization – Vectorized query execution is used for more optimized performance of it. It happens by performing aggregation over batches of 1024 rows at once instead of the single row each time.
  6. Cost-Based Optimization (CBO) – It performs optimization based on query cost. To use CBO parameters are to be set at the beginning of the query.
  7. Indexing – Indexing helps increase optimization. It helps the speed of the process of executing queries by taking less time to do so. 


  1. File format - Usage of ORC file format is best suited for optimizing the execution of queries while using it.
  2. It can join automatically if the feature is enabled.
  3. Dynamic filter feature optimizes the use of JOIN queries
  4. It has added a new connector configuration to skip corrupt records in input formats other than orc, parquet and rcfile.
  5. By setting task.max-worker-threads in, number of CPU cores into hyper-threads per core on a worker node.
  6. Splits can be used for efficient and optimized use in executing the queries in Presto.

What are the advantages?

The advantages of Hive and Presto are:


  1. It is a stable query engine and has a large and active community
  2. Its queries are similar to that of SQL, which are easy to understand by RDBMS professionals
  3. It supports ORC, TextFile, RCFile, Avro and Parquet file Formats


  1. It supports file formats like ORC, Parquet and RCFile formats, eliminating the need for data transformation.
  2. It works well with Amazon S3 queries and Storage, it can query data in mere seconds even if the data is of the size of petabytes.
  3. It also has an active community.

Geospatial Analytics Using Presto and Hive

Modelling geospatial data has quite many complexities. Well, Known Texts are used to model different locations on the map. Various types like point and polygon shapes are used for these purposes. The Spatial Library is used for spatial processing in it with User-Defined Functions and SerDes. Through allowing this library in it, queries may be created using its Query Language (HQL), which is somewhat close to SQL. You will, therefore, stop complex MapReduce algorithms and stick to a more common workflow. Its plugin is running in production at Uber. All GeoSpatial traffic at Uber, more than 90% of it is completed within 5 minutes. Compared with brute force its MapReduce execution, Uber's Geospatial Plugin is more than 50X faster, leading to greater efficiency.

Summing up

Presto has the edge over Hive as it can be used to process unstructured data too, and query processing in it is faster than that in it. The data is collected in a humongous amount daily, and it needs to be extracted efficiently and judiciously to have better working software that requires it.

Original article source at:

#analytics #presto #hive #geospatial 

Tutorial Geospatial Analytics using Presto and Hive
Monty  Boehm

Monty Boehm


How to Create your First HIVE Script

Apache Hadoop : Create your First HIVE Script

As is the case with scripts in other languages such as SQL, Unix Shell etc., Hive scripts are used to execute a set of Hive commands collectively. This helps in reducing the time and effort invested in writing and executing each command manually. This blog is a step by step guide to write your first Hive script and executing it.Check out this Big Data Course to learn more about Hive scripts and Commands in real projects.

Hive supports scripting from Hive 0.10.0 and above versions. Cloudera distribution for hadoop (CDH4) quick VM comes with pre-installed Hive 0.10.0 (CDH3 Demo VM uses Hive 0.90 and hence, cannot run Hive Scripts).

Execute the following steps to create your first Hive Script:

Step1: Writing a script

Open a terminal in your Cloudera CDH4 distribution and give the below command to create a Hive Script.

command: gedit sample.sql

The Hive script file should be saved with .sql extension to enable the execution.

Edit the file and write few Hive commands that will be executed using this script.

In this sample script, we will create a table, describe it, load the data into the table and retrieve the data from this table.

Create a table ‘product’ in Hive:

command: create table product ( productid: int, productname: string, price: float, category: string) rows format delimited fields terminated by ‘,’ ;

Here { productid, productname, price, category} are the columns in the ‘product’ table.

Fields terminated by ‘,’ ” indicates that the columns in the input file are separated by the  ‘,’ delimiter.  You can use other delimiters also. For example, the records in an input file can be separated by a new line (‘
’) character.

Describe the Table :

command: describe product;

Load the data into the Table:

To load the data into the table, create an input file which contains the records that needs to be inserted into the table.

command: sudo gedit input.txt

Create few records in the input text file as shown in the figure.

Command: load data local inpath ‘/home/cloudera/input.txt’ into table product;

Retrieving the data:

To retrieve the data use select command.

command: select * from product;

The above command will retrieve all the records from the table ‘product’.

The script should look like as shown in the following image:


SQL Query - Apache Hadoop Hive Script - EdurekaSave the sample.sql file and close the editor. You are now ready to execute your first Hive script.

Step 2: Execute the Hive Script

Execute the hive script using the following command:

Command: hive –f /home/cloudera/sample.sql

While executing the script, make sure that you give the entire path of the script location. As the sample script is present in the current directory, I haven’t provided the complete path of the script.

The following image shows that all the commands were executed successfully.


Congratulations on executing your first Hive script successfully!!!!. This Hive script knowledge is necessary to clear Big data certifications.

Original article source at:

#hive #hadoop #script 

How to Create your First HIVE Script
Gerhard  Brink

Gerhard Brink


When Small Files Crush Big Data — How to Manage Small Files in Your Data Lake

In this article, see how to manage small files in your data lake.

Big Data faces an ironic small file problem that hampers productivity and wastes valuable resources.

If not managed well, it slows down the performance of your data systems and leaves you with stale analytics. This kind of defeats the purpose, doesn’t it? HDFS stores small files inefficiently, leading to inefficient Namenode memory utilization and RPC calls, block-scanning throughput degradation, and reduced application layer performance. If you are a big data administrator on any modern data lake, you will invariably come face to face with the problem of small files. Distributed file systems are great but let’s face it, the more you split storage layers the greater your overhead is when reading those files. So the idea is to optimize the file size to best serve your use case, while also actively optimizing your data lake.

Slow Files and the Business Impact

  • Slowing down reads — Reading through small files requires multiple seeks to retrieve data from each small file which is an inefficient way of accessing data.
  • Slowing down processing — Small files can slow down Spark, MapReduce, and Hive jobs. For example, MapReduce map-tasks process one block at a time. Files use one map task each and if there are a large no. of small files each map task processes very little input. The larger the number of files the larger the number of tasks.
  • Wasted storage — Hundreds of thousands of files that are 5 KB each or even 1 KB may be created daily while running jobs which adds up quickly. The lack of transparency on where they are located adds complexity.
  • Stale data — All of this results in stale data which can weigh down the entire reporting and analytics process of extracting value. If jobs don’t run fast or if responses are slow, decision making becomes slower and the data stops being as valuable. You lose the edge that the data is meant to bring in the first place.
  • Spending more time tackling operational issues than on strategic improvements — Resources end up being used to actively monitor jobs. If that dependency could be removed resources can be used to explore how to optimize the job itself such that a job that earlier took 4 hours now takes only 1 hour. So, this has a cascading effect.
  • Impacting ability to scale — Operational costs increase exponentially. If you grow 10x in the process, the rise in operation cost is not linear. This impacts your cost to scale. While small files are a massive problem, they aren’t completely avoidable either. Following the best practices to effectively apply them to your organization will give you control over rather than firefighting. In any production system the focus is on keeping it up and running. As issues crop up resources are deployed to tackle it.

The Small File Problem

Let’s take the case of HDFS, a distributed file system that is part of the Hadoop infrastructure, designed to handle large data sets. In HDFS, data is distributed over several machines and replicated to optimize parallel processing. As the data and metadata are stored separately every file created irrespective of size occupies a minimum default block size in memory. Small files are files size less than 1 HDFS block, typically 128MB. Small files, even as small as 1kb, cause excessive load on the name node (which is involved in translating file system operations into block operations on the data node) and consume as much metadata storage space as a file of 128 MB. Smaller file sizes also mean smaller clusters as there are practical limits on the number of files (irrespective of size) that can be managed by a name mode.

#bigdata #spark #mapreduce #hive #data lake #hdfs data files

When Small Files Crush Big Data — How to Manage Small Files in Your Data Lake
Ian  Robinson

Ian Robinson


Performance Tuning Techniques of Hive Big Data Table

  • Developers working on big data applications experience challenges when reading data from Hadoop file systems or Hive tables.
  • Consolidation job, a technique used to merge smaller files to bigger files, can help with the performance of reading Hadoop data.
  • With consolidation, the number of files is significantly reduced and query time to read the data will be faster.
  • Hive tuning parameters can also help with performance when you read Hive table data through a map-reduce job.

Hive table is one of the big data tables which relies on structural data. By default, it stores the data in a Hive warehouse. To store it at a specific location, the developer can set the location using a location tag during the table creation. Hive follows the same SQL concepts like row, columns, and schema.

Developers working on big data applications have a prevalent problem when reading Hadoop file systems data or Hive table data. The data is written in Hadoop clusters using spark streamingNifi streaming jobs, or any streaming or ingestion application. A large number of small data files are written in the Hadoop Cluster by the ingestion job. These files are also called part files.

These part files are written across different data nodes, and when the number of files increases in the directory, it becomes tedious and a performance bottleneck if some other app or user tries to read this data. One of the reasons is that the data is distributed across nodes. Think about your data residing in multiple distributed nodes. The more scattered it is, the job takes around “N * (Number of files)” time to read the data, where N is the number of nodes across each Name Nodes. For example, if there are 1 million files, when we run the MapReduce job, the mapper has to run for 1 million files across data nodes and this will lead to full cluster utilization leading to performance issues.

#apache hadoop #performance tuning #big data #hive #development #ai # ml & data engineering #article

Performance Tuning Techniques of Hive Big Data Table

Migration of Hive Metastore To Azure


While moving the Hadoop workload from an on-premise CDH cluster to Azure, we also had a task to move the existing on-premise Hive metastore. This article provides two of the best practices for Hive Metadata migration from on-premise to Azure HDInsight.

Method 1: Hive Metastore Migration Using DB Replication

Set up database replication between the on-premises Hive metastore DB and HDInsight Hive metastore DB. The ollowing command can be used to setup the replication between the two instances:

./hive --service metatool -updateLocation hdfs://<namenode>:8020/ wasb://<container_name>@<storage_account_name>

The above ‘hive metatool’ will replicate the hive metastore data from the given HDFS to the target WASB/ADLS/ABFS

Recommendation: This approach is recommended when either the source and target metadata DB are identical, or, when you are setting up or migrating existing applications.

Method 2: Hive Metastore Migration Using Scripts

  • Generate the Hive DDLs from the on-premises Hive metastore for myTable as an example, using the following script in the file:


  • Run the above shell script by using ‘metastoreDB’ as a parameter: bash metastoreDB
  • Edit the generated DDL into HiveTableDDL.hql and replace the HDFS URL with WASB/ADLS/ABFS URLs.
  • Run the updated DDL on the target Hive metastore DB being used on HDInsight cluster:


Ensure that the Hive metastore version is compatible between on-premises and Azure         HDInsight Hive instance.

Recommendation: This approach is recommended when either the source and target metadata DB are not identical, or when you are trying to set up a new environment.

Validation: In order to validate that the Hive metastore has been migrated completely, run bash script in step 1 on both the metastore DBs (i.e. source and target) to print all the Hive tables and their data locations.

Compare the outputs generated from the on-premise and Azure HDI to verify that no tables are missing in the new metastore DB.

#azure #migration #hive #metastore

Migration of Hive Metastore To Azure

Learn Flutter With Smrity

Made by Learn Flutter With Smrity

This project works with Firebase so you need to import your own Google Services files on Android & iOS folder after creating your Firebase project. Remember to enable multidex in your flutter project.

Author: Smrity

View Full explanation in youtube :

( “This is 1st image link”)

( “This is 2nd image link”)

Subscribe to get more :

#hive #nosql #database #roadmap

Learn Flutter With Smrity
Snippet Coder

Snippet Coder


Hive ❤️ Flutter - Lightweight & Fast NoSQL Database 🔥| Learn in Just 15 Mins Video

In this video, we will learn The Best way to learn Hive Lightweight & Fast NoSQL Database in Fluter Development. Beginner to Advanced in Just 30 Mins Video

📄Source Code Video

📎Flutter Plugins

🤝Stay Connected with me !
✔ Instagram :
✔ Facebook :
✔ Twitter :
✔ Telegram :
✔ Github :

⛄If you like my work , you can support me
☑️Patreon :
☑️PayPal :
☑️DM For UPI Number


Hive ❤️ Flutter - Lightweight & Fast NoSQL Database 🔥| Learn in Just 15 Mins Video

🔥🔥🔥 Login/Logout System in Flutter With Rest API & WordPress 🔥🔥🔥





Tags and SEO Stuff :
#flutter #hive #snippetcoder flutter #flutternosql #fluttertutorialforbeginners #fluttersqlite #sqflite #fluttersql #fluttersqflitetutorial #flutterdatabase #fluttersqlitetutorial #flutterapp #sqlite #fluttersqflitecrud #fluttersqfliteexample #fluttersqflitedatabase #googleflutter #flutterdatabasesqlite #flutterdart #flutterlocaldatabase #fluttercrossplatform #fluttertutorials #flutterdatabasetutorial #fluttertutorial #snippetcoder #fluttermongodb

#flutter #hive #nosql #mongodb #snippetcoder #sqlflite

Hive ❤️ Flutter - Lightweight & Fast NoSQL Database 🔥| Learn in Just 15 Mins Video
Crypto Like

Crypto Like


What is Hive coin (HIVE) | What is HIVE coin

What Is HIVE (HIVE)?

HIVE launched in March 2020 as a hard fork of the Steem ( STEEM) blockchain with the idea of decentralization. Its developers claim that it is fully decentralized, fast, scalable and has a low barrier to entry. HIVE is a Graphene-based social blockchain and was designed to be an efficient platform for decentralized applications ( DApps).

One of HIVE blockchain’s purported goals is to remove the elements of centralization that were present on the Steem blockchain. One of the HIVE ecosystem’s core components are Smart Media Tokens.

HIVE offers multiple services such as the HIVE Fund, free transactions via a resource credit freemium model, fast block confirmations of 3 seconds or less and time delay security in the form of vested HIVE & savings.

Developed for Web 3 .0

Hive is an innovative and forward-looking decentralized blockchain and ecosystem, designed to scale with widespread adoption of the currency and platforms in mind. By combining the lightning-fast processing times and fee-less transactions, Hive is positioned to become one of the leading Web 3.0 blockchains used by people around the world.

Who Are the Founders of HIVE?

HIVE was co-founded by Olivier Roussy Newton and Harry Pokrandt.

Olivier Roussy Newton is the co-founder and president of HIVE Blockchain Technologies LTD. Aside from that, he also co-founded DeFi Holdings and Exponential Genomics Inc. (Xenomics). He is also the chairman at Quantum Holdings and the co-founder and director of Valour.

Harry Pokrandt is the co-founder and CEO of HIVE Blockchain Technologies LTD. Pokrandt is also the director of KORE Mining Ltd and the director of Blockhead Technologies. Before that, he worked at Sandspring Resources Ltd.

What Makes HIVE Unique?

The main thing that makes HIVE unique is the fact that it has a working ecosystem of applications, communities and individuals.

It is a decentralized blockchain and ecosystem that was designed to scale with widespread adoption. Through the combination of fast processing and zero-fee transactions, HIVE is aiming to become one of the leading Web 3.0 blockchains. According to the developers, HIVE is and will remain open-source.

How Many HIVE (HIVE) Coins Are There in Circulation?

HIVE (HIVE) has a circulating supply of 411,255,996 coins; no maximum supply data is available as of February 2021.

How Is the HIVE Network Secured?

The company behind HIVE holds all of the digital cryptocurrencies in cold storage. There are also secure HIVE wallets available for Windows, MacOS, Linux, iOS, Android & Web, which include: Vessel, Keychain, HIVEWallet, Ecency, HIVESigner, Actifit, Peakd and HIVE.Blog.

Hive is a passionate effort, created by a large group of Steem community members who have long looked to move towards true decentralization and help develop the code base. The years of distribution issues and reliance on a central entity for code and infrastructure has been at the heart of a revolution of sorts, and the new Hive blockchain is the culmination of meeting the challenge of returning to shared values of protecting and celebrating the vibrant community that has grown around the ecosystem.

Since inception, the Steem blockchain has been under de facto control of Steemit Inc, which launched the chain in March 2016. The company held a majority stake in STEEM through ninja-mined assets established initially as a development fund for the Steem blockchain, which has been a contentious part of Steem’s history, governance, and distribution for the better part of four years since.

At 14:00 UTC on Friday, block producers (called witnesses on the Steem blockchain) who want to participate in the new chain will upgrade their software, and the Hive blockchain will be created as a new fork away from the current chain ID. All current Steem blockchain users will automatically exist on Hive, and a mirror of current balances will be airdropped in the new HIVE token at the time of the snapshot and launch. The Steemit Inc accounts, along with accounts pushing for support of the recent governance attack, will exist on the network exactly as before, but will not be eligible for the airdrop.

All of the existing user content from Steem will be ported forwards to Hive as historical data, but from launch onward the two chains will be completely separate, with future posts and transfers only on the parent blockchain. There are no special procedures for claiming the Hive airdrop, and all existing Steem users will be able to access the Hive blockchain with their existing login information to get started. A majority of the ecosystem’s existing apps, alongside new projects and interfaces, are preparing to operate on the Hive network.

One of the most important and exciting features of the Hive blockchain is the trustless “Decentralized Hive Fund” (DHF) development model which allows community management for a portion of the airdrop earmarked as development funds. To maintain the overall supply of Hive at the same level as Steem for launch, a portion of HIVE tokens will be airdropped to the Hive dev fund to create a robust resource pool for decentralized development. These funds will not be usable during the launch period until a future Hive hardfork upgrade is deployed to continuously liquefy the tokens over time to prevent market flooding.

This community fork and the creation of the Hive blockchain itself has been organized by more than 30 community developers, over 80 talented contributors, and the numbers are growing quickly. The open source codebase will be implemented by many current Steem witnesses who are choosing to leave in pursuit of an ecosystem that truly values decentralization and an opportunity to be a more dynamic part of the future roadmap. The coordination, effort, and passion behind this fork has been a widely talked about topic in crypto news, and marks a step forward for DPoS.

Investors of the Hive chain will be able to initially purchase tokens from Ionomy , Probit, and BlockTrades while Hive tokens are under review for listing at other exchanges. A growing number of exchanges including Bittrex, Huobi, Binance, BitThumb, GOPAX, UpBit, and WazirX (more close to finalizing at press time) have committed to supporting the airdrop and have published their intention to work with Hive.

Users of the Hive blockchain will be able to take advantage of three-second block times, free transactions, scalability, personalized names for accounts/wallets, escrow capabilities, and a rewards system for social media, gaming, and publishing use cases. Developers are welcome to join in the buzz and submit code for review and implementation, or to build their applications to run on the Hive blockchain. Through the Hive DHF, all users may submit a variety of proposals for funding, such as for development, business, or marketing, and take part in voting for fund management, governance, reward distribution, and block production when staking funds.

Looking for more information…

WebsiteExplorerExplorer 2WhitepaperSource CodeSocial ChannelCoinmarketcap

Would you like to earn HIVE right now! ☞ CLICK HERE

Top exchanges for token-coin trading. Follow instructions and make unlimited money


Thank for visiting and reading this article! I’m highly appreciate your actions! Please share if you liked it!

#bitcoin #crypto #hive coin #hive

What is Hive coin (HIVE) | What is HIVE coin
Oleta  Becker

Oleta Becker


Deep Dive Into Join Execution in Apache Spark

Join operations are often used in a typical data analytics flow in order to correlate two data sets. Apache Spark, being a unified analytics engine, has also provided a solid foundation to execute a wide variety of Join scenarios.

At a very high level, Join operates on two input data sets and the operation works by matching each of the data records belonging to one of the input data sets with every other data record belonging to another input data set. On finding a match or a non-match (as per a given condition), the Join operation could either output an individual record, being matched, from either of the two data sets or a Joined record. The joined record basically represents the combination of individual records, being matched, from both the data sets.

Important Aspects of Join Operation

Let us now understand the three important aspects that affect the execution of Join operation in Apache Spark. These are:

1) Size of the Input Data sets: The size of the input data sets directly affects the execution efficiency and reliability of the Join operation. Also, the comparative sizing of the input data sets affects the selection of the Join mechanism which could further affect the efficiency and reliability of the Join mechanism.

2) The Join Condition: Condition or the clause on the basis of which the input data sets are being joined is termed as Join Condition. The condition typically involves logical comparison(s) between attributes belonging to the input data sets. Based on the Join condition, Joins are classified into two broad categories, Equi Join and Non-Equi Joins.

  • Equi Joins involves either one equality condition or multiple equality conditions that need to be satisfied simultaneously. Each equality condition being applied between the attributes from the two input data sets. For example, (A.x == B.x) or ((A.x == B.x) and (A.y == B.y)) are the two examples of Equi Join conditions on the x, y attributes of the two input data sets, A and B, participating in a Join operation.
  • Non-Equi Joins do not involve equality conditions. However, they may allow for multiple equality conditions that must not be satisfied simultaneously. For example, (A.x < B.x) or ((A.x == B.x) or (A.y == B.y)) are the two examples of Non-Equi Join conditions on the x, y attributes of the two input data sets, A and B, participating in a Join operation.

3) The Join type: The Join type affects the outcome of the Join operation after the Join condition is applied between the records of the input data sets. Here is the broad classification of the various Join types:

Inner Join: Inner Join outputs only the matched Joined records (on the Join condition) from the input data sets.

Outer Join: Outer Join outputs, in addition to matched Joined records, also outputs the non-matched records. Outer Join is further classified into the left, right, and full outer Joins based on the choice of the input data set(s) for outputting the non-matched records.

Semi Join: Semi Join outputs the individual record belonging to only one of the two input datasets, either on a matched or non-matched instance. If the record, belonging to one of the input datasets, is outputted on a non-matched instance, Semi Join is also called as Anti Join.

Cross Join: Cross Join outputs all Joined records that are possible by combining each record from one input data set with every record of the other input data set.

Based on the above three important aspects of the Join execution, Apache Spark chooses the right mechanism to execute the Join.

#big data #hadoop #data science #data analytics #apache spark #hive #etl #machine learning & # ai #parallel programming

Deep Dive Into Join Execution in Apache Spark
Hertha  Walsh

Hertha Walsh


Drone Fly — Decoupling Event Listeners from the Hive Metastore

We, the Expedia GroupᵀᴹData Platform Team, are building the next-gen petabyte-scale data lake. This next stage in the evolution of our data lake is based on our Apiary data lake pattern and utilizes a number of our open-source components like Waggle-DanceCircus-Train, etc. A Hive metastore (HMS) proxied by the Waggle Dance service is usually the first point of contact for a user query to discover and analyze data. That makes the Hive metastore a critical piece of infrastructure.

We utilize a number of Hive metastore listeners that are installed in the Hive metastore to enable a variety of event-based use cases such as Shunting YardCloverleafBeekeeper, Ranger policies etc. Some of the open-source listeners that we use are:

#hadoop #big-data #data #software-engineering #hive

Drone Fly — Decoupling Event Listeners from the Hive Metastore
Kole  Haag

Kole Haag


Create a Scale-Out Hive Cluster With a Distributed, MySQL-Compatible Database

Hive Metastore supports various backend databases, among which MySQL is the most commonly used. However, in real-world scenarios, MySQL’s shortcoming is obvious: as metadata grows in Hive, MySQL is limited by its standalone performance and can’t deliver good performance. When individual MySQL databases form a cluster, the complexity drastically increases. In scenarios with huge amounts of metadata (for example, a single table has more than 10 million or even 100 million rows of data), MySQL is not a good choice.

We had this problem, and our migration story proves that TiDB, an open-source distributed Hybrid Transactional/Analytical Processing (HTAP) database, is a perfect solution in these scenarios.

In this post, I’ll share with you how to create a Hive cluster with TiDB as the Metastore database at the backend so that you can use TiDB to horizontally scale Hive Metastore without worrying about database capacity.

Why Use TiDB in Hive as the Metastore Database?

TiDB is a distributed SQL database built by PingCAP and its open-source community. It is MySQL compatible and features horizontal scalability, strong consistency, and high availability. It’s a one-stop solution for both Online Transactional Processing (OLTP) and Online Analytical Processing (OLAP) workloads.

In scenarios with enormous amounts of data, due to TiDB’s distributed architecture, query performance is not limited to the capability of a single machine. When the data volume reaches the bottleneck, you can add nodes to improve TiDB’s storage capacity.

Because TiDB is compatible with the MySQL protocol, it’s easy to switch Hive’s Metastore database to TiDB. You can use TiDB as if you were using MySQL, with almost no changes:

  • For the existing Hive cluster, you can use the mysqldump tool to replicate all data in MySQL to TiDB.
  • You can use the metadata initialization tool that comes with Hive to create a new Hive cluster.

How to Create a Hive Cluster With TiDB

Creating a Hive cluster with TiDB involves the following steps:

  • Meet component requirements
  • Install a Hive cluster
  • Deploy a TiDB cluster
  • Configure Hive
  • Initialize metadata
  • Launch Metastore and test

#database #tutorial #mysql #hive #mysql database #scale out #hive cluster

Create a Scale-Out Hive Cluster With a Distributed, MySQL-Compatible Database

Hive on Spark in Kubernetes

It is not easy to run Hive on Kubernetes. As long as I know, Tez which is a hive execution engine can be run just on YARN, not Kubernetes.

There is an alternative to run Hive on Kubernetes. Spark can be run on Kubernetes, and Spark Thrift Server compatible with Hive Server2 is a great candidate. That is, Spark will be run as hive execution engine.

I am going to talk about how to run Hive on Spark in kubernetes cluster .

All the codes mentioned here can be cloned from my github repo:

Assumed that S3 Bucket and NFS as Kubernetes Storage are available

Before running Hive on Kubernetes, your S3 Bucket and NFS as kubernetes storage should be available for your kubernetes cluster.

Your S3 bucket will be used to store the uploaded spark dependency jars, hive tables data, etc.

NFS Storage will be used to support PVC ReadWriteMany Access Mode which is needed to spark job.

If you have no such S3 bucket and NFS available, you can install them on your kubernetes cluster manually like me:

#hive #kubernetes #spark #s3

Hive on Spark in Kubernetes
Anil  Sakhiya

Anil Sakhiya


Hive Tutorial | Hadoop For Beginners | Big Data For Beginners

Hive is a data warehouse infrastructure that is used to process the structured data in Hadoop. It resided at the top of Hadoop to summarize big data and make querying and analyzing easy. Understanding all of this, we have come up with this “Hive Tutorial”

Apache Hive is a data warehouse system for Hadoop that runs SQL like queries called HQL (Hive query language) which gets internally converted to map-reduce jobs. The hive was developed by Facebook. It supports Data definition Language, Data Manipulation Language, and user-defined functions. This tutorial provides basic and advanced concepts of Hive and it is designed for beginners as well as for professionals.
Following pointers will be covered in this video:

  • 00:00:00 Introduction
  • 00:00:40 Agenda
  • 00:13:25 What is Hive?
  • 00:22:20 Hive Architecture
  • 00:28:35 Hive Properties and Demo

Initially, you have to write complex Map-Reduce jobs, but now with the help of the Hive, you just need to submit merely SQL queries. Hive is mainly targeted towards users who are comfortable with SQL. HiveQL automatically translates SQL-like queries into MapReduce jobs. In this tutorial, we have covered all the important topics like What is Hive, Why and how to use Hive, what are its features, and much more that will help you to understand this concept with ease.

#big-data #hive #hadoop

Hive Tutorial | Hadoop For Beginners | Big Data For Beginners