1669261457
In my previous blog on HBase Tutorial, I explained what is HBase and its features. I also mentioned Facebook messenger’s case study to help you to connect better. Now moving ahead, I will explain the data model of HBase and HBase Architecture. Before you move on, you should also know that HBase is an important concept that makes up an integral portion of the course curriculum for Big Data Hadoop Certification.
The important topics that I will be taking you through in this HBase architecture blog are:
Let us first understand the data model of HBase. It helps HBase in faster read/write and searches.
As we know, HBase is a column-oriented NoSQL database. Although it looks similar to a relational database which contains rows and columns, but it is not a relational database. Relational databases are row oriented while HBase is column-oriented. So, let us first understand the difference between Column-oriented and Row-oriented databases:
Row-oriented vs column-oriented Databases:
To better understand it, let us take an example and consider the table below.
If this table is stored in a row-oriented database. It will store the records as shown below:
1, Paul Walker, US, 231, Gallardo,
2, Vin Diesel, Brazil, 520, Mustang
In row-oriented databases data is stored on the basis of rows or tuples as you can see above.
While the column-oriented databases store this data as:
1,2, Paul Walker, Vin Diesel, US, Brazil, 231, 520, Gallardo, Mustang
In a column-oriented databases, all the column values are stored together like first column values will be stored together, then the second column values will be stored together and data in other columns are stored in a similar manner.
HBase tables has following components, shown in the image below:
In a more simple and understanding way, we can say HBase consists of:
Now that you know about HBase Data Model, let us see how this data model falls in line with HBase Architecture and makes it suitable for large storage and faster processing.
HBase has three major components i.e., HMaster Server, HBase Region Server, Regions and Zookeeper.
The below figure explains the hierarchy of the HBase Architecture. We will talk about each one of them individually.
Now before going to the HMaster, we will understand Regions as all these Servers (HMaster, Region Server, Zookeeper) are placed to coordinate and manage Regions and perform various operations inside the Regions. So you would be curious to know what are regions and why are they so important?
The best way to become a Data Engineer is by getting the Azure Data Engineering Course in India.
A region contains all the rows between the start key and the end key assigned to that region. HBase tables can be divided into a number of regions in such a way that all the columns of a column family is stored in one region. Each region contains the rows in a sorted order.
Many regions are assigned to a Region Server, which is responsible for handling, managing, executing reads and writes operations on that set of regions.
So, concluding in a simpler way:
Now starting from the top of the hierarchy, I would first like to explain you about HMaster Server which acts similarly as a NameNode in HDFS. Then, moving down in the hierarchy, I will take you through ZooKeeper and Region Server.
As in the below image, you can see the HMaster handles a collection of Region Server which resides on DataNode. Let us understand how HMaster does that.
HBase has a distributed and huge environment where HMaster alone is not sufficient to manage everything. So, you would be wondering what helps HMaster to manage this huge environment? That’s where ZooKeeper comes into the picture. After we understood how HMaster manages HBase environment, we will understand how Zookeeper helps HMaster in managing the environment.
This below image explains the ZooKeeper’s coordination mechanism.
As I talked about .META Server, let me first explain to you what is .META server? So, you can easily relate the work of ZooKeeper and .META Server together. Later, when I will explain you the HBase search mechanism in this blog, I will explain how these two work in collaboration. You can get a better understanding with the Azure Data Engineering Certification.
As I already discussed, Region Server and its functions while I was explaining you Regions hence, now we are moving down the hierarchy and I will focus on the Region Server’s component and their functions. Later I will discuss the mechanism of searching, reading, writing and understand how all these components work together.
This below image shows the components of a Region Server. Now, I will discuss them separately.
A Region Server maintains various regions running on the top of HDFS. Components of a Region Server are:
Now that we know major and minor components of HBase Architecture, I will explain the mechanism and their collaborative effort in this. Whether it’s reading or writing, first we need to search from where to read or where to write a file. So, let’s understand this search process, as this is one of the mechanisms which makes HBase very popular.
As you know, Zookeeper stores the META table location. Whenever a client approaches with a read or writes requests to HBase following operation occurs:
For future references, the client uses its cache to retrieve the location of META table and previously read row key’s Region Server. Then the client will not refer to the META table, until and unless there is a miss because the region is shifted or moved. Then it will again request to the META server and update the cache.
As every time, clients does not waste time in retrieving the location of Region Server from META Server, thus, this saves time and makes the search process faster. Now, let me tell you how writing takes place in HBase. What are the components involved in it and how are they involved?
This below image explains the write mechanism in HBase.
The write mechanism goes through the following process sequentially (refer to the above image):
Step 1: Whenever the client has a write request, the client writes the data to the WAL (Write Ahead Log).
Step 2: Once data is written to the WAL, then it is copied to the MemStore.
Step 3: Once the data is placed in MemStore, then the client receives the acknowledgment.
Step 4: When the MemStore reaches the threshold, it dumps or commits the data into a HFile.
Now let us take a deep dive and understand how MemStore contributes in the writing process and what are its functions?
As I discussed several times, that HFile is the main persistent storage in an HBase architecture. At last, all the data is committed to HFile which is the permanent storage of HBase. Hence, let us look at the properties of HFile which makes it faster for search while reading and writing.
After knowing the write mechanism and the role of various components in making write and search faster. I will be explaining to you how the reading mechanism works inside an HBase architecture? Then we will move to the mechanisms which increases HBase performance like compaction, region split and recovery.
As discussed in our search mechanism, first the client retrieves the location of the Region Server from .META Server if the client does not have it in its cache memory. Then it goes through the sequential steps as follows:
So far, I have discussed search, read and write mechanism of HBase. Now we will look at the HBase mechanism which makes search, read and write quick in HBase. First, we will understand Compaction, which is one of those mechanisms.
HBase combines HFiles to reduce the storage and reduce the number of disk seeks needed for a read. This process is called compaction. Compaction chooses some HFiles from a region and combines them. There are two types of compaction as you can see in the above image.
But during this process, input-output disks and network traffic might get congested. This is known as write amplification. So, it is generally scheduled during low peak load timings.
Now another performance optimization process which I will discuss is Region Split. This is very important for load balancing.
Weekday / Weekend BatchesSee Batch Details
The below figure illustrates the Region Split mechanism.
Whenever a region becomes large, it is divided into two child regions, as shown in the above figure. Each region represents exactly a half of the parent region. Then this split is reported to the HMaster. This is handled by the same Region Server until the HMaster allocates them to a new Region Server for load balancing.
Moving down the line, last but the not least, I will explain you how does HBase recover data after a failure. As we know that Failure Recovery is a very important feature of HBase, thus let us know how HBase recovers data after a failure.
I hope this blog would have helped you in understating the HBase Data Model & HBase Architecture. Hope you enjoyed it. Now you can relate to the features of HBase (which I explained in my previous HBase Tutorial blog) with HBase Architecture and understand how it works internally. Now that you know the theoretical part of HBase, you should move to the practical part. Keeping this in mind, our next blog of Hadoop Tutorial Series will be explaining a sample HBase POC.
Now that you have understood the HBase Architecture, check out the Hadoop training in Bangalore by Edureka, a trusted online learning company with a network of more than 250,000 satisfied learners spread across the globe. The Edureka Big Data Hadoop Certification Training course helps learners become expert in HDFS, Yarn, MapReduce, Pig, Hive, HBase, Oozie, Flume and Sqoop using real-time use cases on Retail, Social Media, Aviation, Tourism, Finance domain.
Got a question for us? Please mention it in the comments section and we will get back to you.
Original article source at: https://www.edureka.co/
1620466520
If you accumulate data on which you base your decision-making as an organization, you should probably think about your data architecture and possible best practices.
If you accumulate data on which you base your decision-making as an organization, you most probably need to think about your data architecture and consider possible best practices. Gaining a competitive edge, remaining customer-centric to the greatest extent possible, and streamlining processes to get on-the-button outcomes can all be traced back to an organization’s capacity to build a future-ready data architecture.
In what follows, we offer a short overview of the overarching capabilities of data architecture. These include user-centricity, elasticity, robustness, and the capacity to ensure the seamless flow of data at all times. Added to these are automation enablement, plus security and data governance considerations. These points from our checklist for what we perceive to be an anticipatory analytics ecosystem.
#big data #data science #big data analytics #data analysis #data architecture #data transformation #data platform #data strategy #cloud data platform #data acquisition
1620629020
The opportunities big data offers also come with very real challenges that many organizations are facing today. Often, it’s finding the most cost-effective, scalable way to store and process boundless volumes of data in multiple formats that come from a growing number of sources. Then organizations need the analytical capabilities and flexibility to turn this data into insights that can meet their specific business objectives.
This Refcard dives into how a data lake helps tackle these challenges at both ends — from its enhanced architecture that’s designed for efficient data ingestion, storage, and management to its advanced analytics functionality and performance flexibility. You’ll also explore key benefits and common use cases.
As technology continues to evolve with new data sources, such as IoT sensors and social media churning out large volumes of data, there has never been a better time to discuss the possibilities and challenges of managing such data for varying analytical insights. In this Refcard, we dig deep into how data lakes solve the problem of storing and processing enormous amounts of data. While doing so, we also explore the benefits of data lakes, their use cases, and how they differ from data warehouses (DWHs).
This is a preview of the Getting Started With Data Lakes Refcard. To read the entire Refcard, please download the PDF from the link above.
#big data #data analytics #data analysis #business analytics #data warehouse #data storage #data lake #data lake architecture #data lake governance #data lake management
1669261457
In my previous blog on HBase Tutorial, I explained what is HBase and its features. I also mentioned Facebook messenger’s case study to help you to connect better. Now moving ahead, I will explain the data model of HBase and HBase Architecture. Before you move on, you should also know that HBase is an important concept that makes up an integral portion of the course curriculum for Big Data Hadoop Certification.
The important topics that I will be taking you through in this HBase architecture blog are:
Let us first understand the data model of HBase. It helps HBase in faster read/write and searches.
As we know, HBase is a column-oriented NoSQL database. Although it looks similar to a relational database which contains rows and columns, but it is not a relational database. Relational databases are row oriented while HBase is column-oriented. So, let us first understand the difference between Column-oriented and Row-oriented databases:
Row-oriented vs column-oriented Databases:
To better understand it, let us take an example and consider the table below.
If this table is stored in a row-oriented database. It will store the records as shown below:
1, Paul Walker, US, 231, Gallardo,
2, Vin Diesel, Brazil, 520, Mustang
In row-oriented databases data is stored on the basis of rows or tuples as you can see above.
While the column-oriented databases store this data as:
1,2, Paul Walker, Vin Diesel, US, Brazil, 231, 520, Gallardo, Mustang
In a column-oriented databases, all the column values are stored together like first column values will be stored together, then the second column values will be stored together and data in other columns are stored in a similar manner.
HBase tables has following components, shown in the image below:
In a more simple and understanding way, we can say HBase consists of:
Now that you know about HBase Data Model, let us see how this data model falls in line with HBase Architecture and makes it suitable for large storage and faster processing.
HBase has three major components i.e., HMaster Server, HBase Region Server, Regions and Zookeeper.
The below figure explains the hierarchy of the HBase Architecture. We will talk about each one of them individually.
Now before going to the HMaster, we will understand Regions as all these Servers (HMaster, Region Server, Zookeeper) are placed to coordinate and manage Regions and perform various operations inside the Regions. So you would be curious to know what are regions and why are they so important?
The best way to become a Data Engineer is by getting the Azure Data Engineering Course in India.
A region contains all the rows between the start key and the end key assigned to that region. HBase tables can be divided into a number of regions in such a way that all the columns of a column family is stored in one region. Each region contains the rows in a sorted order.
Many regions are assigned to a Region Server, which is responsible for handling, managing, executing reads and writes operations on that set of regions.
So, concluding in a simpler way:
Now starting from the top of the hierarchy, I would first like to explain you about HMaster Server which acts similarly as a NameNode in HDFS. Then, moving down in the hierarchy, I will take you through ZooKeeper and Region Server.
As in the below image, you can see the HMaster handles a collection of Region Server which resides on DataNode. Let us understand how HMaster does that.
HBase has a distributed and huge environment where HMaster alone is not sufficient to manage everything. So, you would be wondering what helps HMaster to manage this huge environment? That’s where ZooKeeper comes into the picture. After we understood how HMaster manages HBase environment, we will understand how Zookeeper helps HMaster in managing the environment.
This below image explains the ZooKeeper’s coordination mechanism.
As I talked about .META Server, let me first explain to you what is .META server? So, you can easily relate the work of ZooKeeper and .META Server together. Later, when I will explain you the HBase search mechanism in this blog, I will explain how these two work in collaboration. You can get a better understanding with the Azure Data Engineering Certification.
As I already discussed, Region Server and its functions while I was explaining you Regions hence, now we are moving down the hierarchy and I will focus on the Region Server’s component and their functions. Later I will discuss the mechanism of searching, reading, writing and understand how all these components work together.
This below image shows the components of a Region Server. Now, I will discuss them separately.
A Region Server maintains various regions running on the top of HDFS. Components of a Region Server are:
Now that we know major and minor components of HBase Architecture, I will explain the mechanism and their collaborative effort in this. Whether it’s reading or writing, first we need to search from where to read or where to write a file. So, let’s understand this search process, as this is one of the mechanisms which makes HBase very popular.
As you know, Zookeeper stores the META table location. Whenever a client approaches with a read or writes requests to HBase following operation occurs:
For future references, the client uses its cache to retrieve the location of META table and previously read row key’s Region Server. Then the client will not refer to the META table, until and unless there is a miss because the region is shifted or moved. Then it will again request to the META server and update the cache.
As every time, clients does not waste time in retrieving the location of Region Server from META Server, thus, this saves time and makes the search process faster. Now, let me tell you how writing takes place in HBase. What are the components involved in it and how are they involved?
This below image explains the write mechanism in HBase.
The write mechanism goes through the following process sequentially (refer to the above image):
Step 1: Whenever the client has a write request, the client writes the data to the WAL (Write Ahead Log).
Step 2: Once data is written to the WAL, then it is copied to the MemStore.
Step 3: Once the data is placed in MemStore, then the client receives the acknowledgment.
Step 4: When the MemStore reaches the threshold, it dumps or commits the data into a HFile.
Now let us take a deep dive and understand how MemStore contributes in the writing process and what are its functions?
As I discussed several times, that HFile is the main persistent storage in an HBase architecture. At last, all the data is committed to HFile which is the permanent storage of HBase. Hence, let us look at the properties of HFile which makes it faster for search while reading and writing.
After knowing the write mechanism and the role of various components in making write and search faster. I will be explaining to you how the reading mechanism works inside an HBase architecture? Then we will move to the mechanisms which increases HBase performance like compaction, region split and recovery.
As discussed in our search mechanism, first the client retrieves the location of the Region Server from .META Server if the client does not have it in its cache memory. Then it goes through the sequential steps as follows:
So far, I have discussed search, read and write mechanism of HBase. Now we will look at the HBase mechanism which makes search, read and write quick in HBase. First, we will understand Compaction, which is one of those mechanisms.
HBase combines HFiles to reduce the storage and reduce the number of disk seeks needed for a read. This process is called compaction. Compaction chooses some HFiles from a region and combines them. There are two types of compaction as you can see in the above image.
But during this process, input-output disks and network traffic might get congested. This is known as write amplification. So, it is generally scheduled during low peak load timings.
Now another performance optimization process which I will discuss is Region Split. This is very important for load balancing.
Weekday / Weekend BatchesSee Batch Details
The below figure illustrates the Region Split mechanism.
Whenever a region becomes large, it is divided into two child regions, as shown in the above figure. Each region represents exactly a half of the parent region. Then this split is reported to the HMaster. This is handled by the same Region Server until the HMaster allocates them to a new Region Server for load balancing.
Moving down the line, last but the not least, I will explain you how does HBase recover data after a failure. As we know that Failure Recovery is a very important feature of HBase, thus let us know how HBase recovers data after a failure.
I hope this blog would have helped you in understating the HBase Data Model & HBase Architecture. Hope you enjoyed it. Now you can relate to the features of HBase (which I explained in my previous HBase Tutorial blog) with HBase Architecture and understand how it works internally. Now that you know the theoretical part of HBase, you should move to the practical part. Keeping this in mind, our next blog of Hadoop Tutorial Series will be explaining a sample HBase POC.
Now that you have understood the HBase Architecture, check out the Hadoop training in Bangalore by Edureka, a trusted online learning company with a network of more than 250,000 satisfied learners spread across the globe. The Edureka Big Data Hadoop Certification Training course helps learners become expert in HDFS, Yarn, MapReduce, Pig, Hive, HBase, Oozie, Flume and Sqoop using real-time use cases on Retail, Social Media, Aviation, Tourism, Finance domain.
Got a question for us? Please mention it in the comments section and we will get back to you.
Original article source at: https://www.edureka.co/
1617649020
Lots of people have increasing volumes of data and are trying to run data management programs to better sort it. Interestingly, people’s problems are pretty much the same throughout different sectors of any industry, and data management helps them configure solutions.
The fundamentals of enterprise data management (EDM), which one uses to tackle these kinds of initiatives, are the same whether one is in the health sector, a telco travel company, or a government agency, and more! Therefore, the fundamental practices that one needs to follow to manage data are similar from one industry to another.
For example, suppose you’re about to set off and design a program. In this case, it may be your integration platform project or your big warehouse project; however, the principles for designing that program of work is pretty much the same regardless of the actual details of the project.
#big data #bigdata #big data analytics #data management #data modeling #data governance #enterprise data #enterprise data management #edm
1618039260
The COVID-19 pandemic disrupted supply chains and brought economies around the world to a standstill. In turn, businesses need access to accurate, timely data more than ever before. As a result, the demand for data analytics is skyrocketing as businesses try to navigate an uncertain future. However, the sudden surge in demand comes with its own set of challenges.
Here is how the COVID-19 pandemic is affecting the data industry and how enterprises can prepare for the data challenges to come in 2021 and beyond.
#big data #data #data analysis #data security #data integration #etl #data warehouse #data breach #elt