Create a Scale-Out Hive Cluster With a Distributed, MySQL-Compatible Database

Create a Scale-Out Hive Cluster With a Distributed, MySQL-Compatible Database

This post shows how to deploy a Hive cluster with TiDB to achieve horizontal scalability of Hive Metastore.

Hive Metastore supports various backend databases, among which MySQL is the most commonly used. However, in real-world scenarios, MySQL's shortcoming is obvious: as metadata grows in Hive, MySQL is limited by its standalone performance and can't deliver good performance. When individual MySQL databases form a cluster, the complexity drastically increases. In scenarios with huge amounts of metadata (for example, a single table has more than 10 million or even 100 million rows of data), MySQL is not a good choice.

We had this problem, and our migration story proves that TiDB, an open-source distributed Hybrid Transactional/Analytical Processing (HTAP) database, is a perfect solution in these scenarios.

In this post, I'll share with you how to create a Hive cluster with TiDB as the Metastore database at the backend so that you can use TiDB to horizontally scale Hive Metastore without worrying about database capacity.

Why Use TiDB in Hive as the Metastore Database?

TiDB is a distributed SQL database built by PingCAP and its open-source community. It is MySQL compatible and features horizontal scalability, strong consistency, and high availability. It's a one-stop solution for both Online Transactional Processing (OLTP) and Online Analytical Processing (OLAP) workloads.

In scenarios with enormous amounts of data, due to TiDB's distributed architecture, query performance is not limited to the capability of a single machine. When the data volume reaches the bottleneck, you can add nodes to improve TiDB's storage capacity.

Because TiDB is compatible with the MySQL protocol, it's easy to switch Hive's Metastore database to TiDB. You can use TiDB as if you were using MySQL, with almost no changes:

  • For the existing Hive cluster, you can use the mysqldump tool to replicate all data in MySQL to TiDB.
  • You can use the metadata initialization tool that comes with Hive to create a new Hive cluster.

How to Create a Hive Cluster With TiDB

Creating a Hive cluster with TiDB involves the following steps:

  • Meet component requirements
  • Install a Hive cluster
  • Deploy a TiDB cluster
  • Configure Hive
  • Initialize metadata
  • Launch Metastore and test

database tutorial mysql hive mysql database scale out hive cluster

Bootstrap 5 Complete Course with Examples

Bootstrap 5 Tutorial - Bootstrap 5 Crash Course for Beginners

Nest.JS Tutorial for Beginners

Hello Vue 3: A First Look at Vue 3 and the Composition API

Building a simple Applications with Vue 3

Deno Crash Course: Explore Deno and Create a full REST API with Deno

How to Build a Real-time Chat App with Deno and WebSockets

Convert HTML to Markdown Online

HTML entity encoder decoder Online

Best MySQL DigitalOcean Performance – ScaleGrid vs. DigitalOcean Managed Databases

Compare ScaleGrid MySQL vs. DigitalOcean Managed Databases - See which offers the best MySQL throughput, latency, and pricing on DigitalOcean across workloads.

Exploring MySQL Binlog Server - Ripple

How MySQL Ripple binlog server can help you improve the performance of your master server by introducing a proxy server between the master and slave servers.MySQL does not limit the number of slaves that you can connect to the master server in a replication topology.

MySQL In 20 Minutes | Introduction To MySQL | MySQL Database | SQL DBMS

What MySQL is? What is a Database and Database Management System and we will discuss how to work with databases and process data using MySQL

MySQL Database Tutorial for Beginners to Advanced Part 1

Learn about MySQL Database from scratch. Go from zero to hero in MySQL Database with this complete course.

MySQL on Azure Performance Benchmark – ScaleGrid vs. Azure Database

In this performance benchmark for MySQL on Azure, we put ScaleGrid up against Azure Database. See which provider gives you the most value for your money! Microsoft Azure is one of the most popular cloud providers in the world, and a natural fit for database hosting on applications leveraging Microsoft across their infrastructure. MySQL is the number one open source database that’s commonly hosted through Azure instances. While Microsoft offers their own Azure Database product, there are other alternatives available that may be able to help you improve your MySQL performance.