Hadoop

Hadoop

Hadoop is an Apache open-source project that provides software for reliable and scalable distributed computing. The core consists of a distributed file system (HDFS) and a resource manager (YARN). Various other open-source projects, such as Apache Hive use Apache Hadoop as persistence layer.
Oral  Brekke

Oral Brekke

1675807800

Difference between: MapReduce vs Spark

What is MapReduce in big data:

MapReduce is a programming model for processing large data sets in parallel across a cluster of computers. It is a key technology for handling big data. The model consists of two key functions: Map and Reduce. Map takes a set of data and converts it into another set of data. There individual elements are broken down into tuples (key/value pairs). Reduce takes the output from the Map as input and aggregates the tuples into a smaller set of tuples. The combination of these two functions allows for the efficient processing of large amounts of data by dividing the work into smaller, more manageable chunks.

Is there any point of learning MapReduce, then?

Definitely, learning MapReduce is worth it if you’re interested in big data processing or work in data-intensive fields. MapReduce is a fundamental concept that gives you a basic understanding of how to process and analyze large data sets in a distributed environment. The principles of MapReduce still play a crucial role during modern big data processing frameworks, such as Apache Hadoop and Apache Spark. Understanding MapReduce provides a solid foundation for learning these technologies. Also, many organizations still use MapReduce for processing large data sets accordingly, making it a valuable skill to have in the job market.

Example:

Let’s understand this with a simple example:

Imagine we have a large dataset of words and we want to count the frequency of each word. Here’s how we could do it in MapReduce:

Map:

  • The map function takes each line of the input dataset and splits it into words.
  • For each word, the map function outputs a tuple (word, 1) indicating that the word has been found once.

Reduce:

  • The reduce function takes all the tuples with the same word and adds up the values (counts) for each word.
  • The reduce function outputs a tuple (word, count) for each unique word in the input dataset.
import org.apache.hadoop.io.{IntWritable, Text}
import org.apache.hadoop.mapreduce.{Mapper, Reducer}
import org.apache.hadoop.mapreduce.Job
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.Path

class TokenizerMapper extends Mapper[Object, Text, Text, IntWritable] {
  val one = new IntWritable(1)
  val word = new Text()

  override def map(key: Object, value: Text, context: Mapper[Object, Text, Text, IntWritable]#Context): Unit = {
    val itr = new StringTokenizer(value.toString)
    while (itr.hasMoreTokens) {
      word.set(itr.nextToken)
      context.write(word, one)
    }
  }
}

class IntSumReducer extends Reducer[Text, IntWritable, Text, IntWritable] {
  val result = new IntWritable

  override def reduce(key: Text, values: java.lang.Iterable[IntWritable], context: Reducer[Text, IntWritable, Text, IntWritable]#Context): Unit = {
    var sum = 0
    val valuesIter = values.iterator
    while (valuesIter.hasNext) {
      sum += valuesIter.next.get
    }
    result.set(sum)
    context.write(key, result)
  }
}

object WordCount {
  def main(args: Array[String]): Unit = {
    val conf = new Configuration
    val job = Job.getInstance(conf, "word count")
    job.setJarByClass(this.getClass)
    job.setMapperClass(classOf[TokenizerMapper])
    job.setCombinerClass(classOf[IntSumReducer])
    job.setReducerClass(classOf[IntSumReducer])
    job.setOutputKeyClass(classOf[Text])
    job.setOutputValueClass(classOf[IntWritable])
    FileInputFormat.addInputPath(job, new Path(args(0)))
    FileOutputFormat.setOutputPath(job, new Path(args(1)))
    System.exit(if (job.waitForCompletion(true)) 0 else 1)
  }
}

This code defines a MapReduce job that splits each line of the input into words using the TokenizerMapper class, maps each word to a tuple (word, 1) and then reduces the tuples to count the frequency of each word using the IntSumReducer class. The job is configured using a Job object and the input and output paths are specified using FileInputFormat and FileOutputFormat. The job is then executed by calling waitForCompletion.

And here’s how you could perform the same operation in Apache Spark:

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf

object WordCount {
  def main(args: Array[String]) {
    val conf = new SparkConf().setAppName("WordCount")
    val sc = new SparkContext(conf)
    val textFile = sc.textFile("<input_file>.txt")
    val counts = textFile.flatMap(line => line.split(" "))
      .map(word => (word, 1))
      .reduceByKey(_ + _)
    counts.foreach(println)
    sc.stop()
  }
}

This code sets up a SparkConf and SparkContext, reads in the input data using textFile, splits each line into words using flatMap, maps each word to a tuple (word, 1) using map, and reduces the tuples to count the frequency of each word using reduceByKey. The result is then printed using foreach.

Conclusion:

MapReduce is a programming paradigm for processing large datasets in a distributed environment. The MapReduce process consists of two main phases: the map phase and the reduce phase. In the map phase, data is transformed into intermediate key-value pairs. In the reduce phase, the intermediate results are aggregated to produce the final output. Spark is a popular alternative to MapReduce. It provides a high-level API and in-memory processing that can make big data processing faster and easier. Whether to choose MapReduce or Spark, depends on the specific needs of the task and the resources available.

Original article source at: https://blog.knoldus.com/

#hadoop #spark 

Difference between: MapReduce vs Spark
Riley Lambert

Riley Lambert

1675669635

Spark Tutorial for Beginners

Explore Spark in depth and get a strong foundation in Spark. You'll learn: Why do we need Spark when we have Hadoop? What is the need for RDD? How Spark is faster than Hadoop? How Spark achieves the speed and efficiency it claims? How does memory gets managed in Spark? How fault tolerance work in Spark? and more

Spark Tutorial for Beginners

Most courses and other online help including Spark's documentation is not good in helping students understand the foundational concepts. They explain what is Spark, what is RDD, what is "this" and what is "that" but students were most interested in understanding core fundamentals and more importantly answer questions like:

  •        Why do we need Spark when we have Hadoop ? 
  •        What is the need for RDD ?
  •        How Spark is faster than Hadoop?
  •        How Spark achieves the speed and efficiency it claims ?
  •        How does memory gets managed in Spark?
  •        How fault tolerance work in Spark ?

and that is exactly what you will learn in this Spark Starter Kit course. The aim of this course is to give you a strong foundation in Spark.

What you’ll learn

  •        Learn about the similarities and differences between Spark and Hadoop.
  •        Explore the challenges Spark tries to address, you will give you a good idea about the need for spark.
  •        Learn “How Spark is faster than Hadoop?”, you will understand the reasons behind Spark’s performance and efficiency.
  •        Before we talk about what is RDD, we explain in detail what is the need for something like RDD.
  •        You will get a strong foundantion in understanding RDDs in depth and then we take a step further to point out and clarify some of the common misconceptions about RDD among new Spark learners.
  •        You will understand the types of dependencies between RDD and more importantly we will see why dependencies are important.
  •        We will walk you through step by step how the program we write gets translated in to actual execution behind the scenes in a Spark cluster.
  •        You will get a very good understanding of some of the key concepts behind Spark’s execution engine and the reasons why it is efficient.
  •        Master fault tolerance by simulating a fault situation and examine how Spark recover from it.
  •        You will learn how memory and the contents in memory are managed by spark.
  •        Understand the need for a new programming language like Scala.
  •        Examine object oriented programming vs. functional programming.
  •        Explore Scala's features and functions.

Are there any course requirements or prerequisites?

  •        Basic Hadoop concepts.

Who this course is for:

  •        Anyone who is interested in distributed systems and computing and big data related technologies.

#spark #hadoop #bigdata

Spark Tutorial for Beginners
Justen  Hintz

Justen Hintz

1674093222

Big Data & Hadoop for Beginners - Full Course in 12 Hours

This Big Data & Hadoop full course will help you understand and learn Hadoop concepts in detail. You'll learn: Introduction to Big Data, Hadoop Fundamentals, HDFS, MapReduce, Sqoop, Flume, Pig, Hive, NoSQL-HBase, Oozie, Hadoop Projects, Career in Big Data Domain, Big Data Hadoop Interview Q and A

Big Data & Hadoop Full Course In 12 Hours | BigData Hadoop Tutorial For Beginners 

This Edureka Big Data & Hadoop Full Course video will help you understand and learn Hadoop concepts in detail. This Big Data & Hadoop Tutorial is ideal for both beginners as well as professionals who want to master the Hadoop Ecosystem. Below are the topics covered in this Big Data Full Course:

  • Introduction to Big Data 
  • Hadoop Fundamentals
  • HDFS
  • MapReduce
  • Sqoop
  • Flume
  • Pig
  • Hive 
  • NoSQL-HBase
  • Oozie
  • Hadoop Projects
  • Career in Big Data Domain 
  • Big Data Hadoop Interview Q and A

#bigdata #hadoop #pig #hive #nosql 

Big Data & Hadoop for Beginners - Full Course in 12 Hours
Monty  Boehm

Monty Boehm

1669626660

Reduce Side Join in Hadoop MapReduce

MapReduce Example: Reduce Side Join in Hadoop MapReduce

Introduction:

In this blog, I am going to explain you how a reduce side join is performed in Hadoop MapReduce using a MapReduce example. Here, I am assuming that you are already familiar with MapReduce framework and know how to write a basic MapReduce program. In case you don’t, I would suggest you to go through my previous blog on MapReduce Tutorial so that you can grasp the concepts discussed here without facing any difficulties. The topics discussed in this blog are as follows:

  • What is a Join?
  • Joins in MapReduce
  • What is a Reduce side join?
  • MapReduce Example on Reduce side join
  • Conclusion

What is a Join?

The join operation is used to combine two or more database tables based on foreign keys. In general, companies maintain separate tables for the customer and the transaction records in their database. And, many times these companies need to generate analytic reports using the data present in such separate tables. Therefore, they perform a join operation on these separate tables using a common column (foreign key), like customer id, etc., to generate a combined table. Then, they analyze this combined table to get the desired analytic reports.

Joins in MapReduce

Just like SQL join, we can also perform join operations in MapReduce on different data sets. There are two types of join operations in MapReduce:

  • Map Side Join: As the name implies, the join operation is performed in the map phase itself. Therefore, in the map side join, the mapper performs the join and it is mandatory that the input to each map is partitioned and sorted according to the keys.

The map side join has been covered in a separate blog with an example. Click Here to go through that blog to understand how the map side join works and what are its advantages.  

  • Reduce Side Join: As the name suggests, in the reduce side join, the reducer is responsible for performing the join operation. It is comparatively simple and easier to implement than the map side join as the sorting and shuffling phase sends the values having identical keys to the same reducer and therefore, by default, the data is organized for us.

Now, let us understand the reduce side join in detail.

What is Reduce Side Join?

Reduce Side Join - MapReduce Example: Reduce Side Join - Edureka

As discussed earlier, the reduce side join is a process where the join operation is performed in the reducer phase. Basically, the reduce side join takes place in the following manner:

  • Mapper reads the input data which are to be combined based on common column or join key.
  • The mapper processes the input and adds a tag to the input to distinguish the input belonging from different sources or data sets or databases.
  • The mapper outputs the intermediate key-value pair where the key is nothing but the join key.
  • After the sorting and shuffling phase, a key and the list of values is generated for the reducer. 
  • Now, the reducer joins the values present in the list with the key to give the final aggregated output.

Meanwhile, you may go through this MapReduce Tutorial video where various MapReduce Use Cases has been clearly explained and practically demonstrated:

MapReduce Example | MapReduce Programming | Hadoop MapReduce Tutorial | Edureka

 

Now, let us take a MapReduce example to understand the above steps in the reduce side join.

MapReduce Example of Reduce Side Join

Suppose that I have two separate datasets of a sports complex:

  • cust_details: It contains the details of the customer.
  • transaction_details: It contains the transaction record of the customer.

Using these two datasets, I want to know the lifetime value of each customer. In doing so, I will be needing the following things:

  • The person’s name along with the frequency of the visits by that person.
  • The total amount spent by him/her for purchasing the equipment.

Input Database - Reduce Side Join - Edureka

The above figure is just to show you the schema of the two datasets on which we will perform the reduce side join operation. Click on the button below to download the whole project containing the source code and the input files for this MapReduce example:

Kindly, keep the following things in mind while importing the above MapReduce example project on reduce side join into Eclipse:

  • The input files are in input_files directory of the project. Load these into your HDFS. 
  • Don’t forget to build the path of Hadoop Reference Jars (present in reduce side join project lib directory)  according to your system or VM. 

Now, let us understand what happens inside the map and reduce phases in this MapReduce example on reduce side join:

1. Map Phase:

I will have a separate mapper for each of the two datasets i.e. One mapper for cust_details input and other for transaction_details input.

Mapper for cust_details:

public static class CustsMapper extends Mapper <Object, Text, Text, Text>
{
public void map(Object key, Text value, Context context) throws IOException, InterruptedException
{String record = value.toString();
String[] parts = record.split(",");
context.write(new Text(parts[0]), new Text("cust    " + parts[1]));
}
}
  • I will read the input taking one tuple at a time.
  • Then, I will tokenize each word in that tuple and fetch the cust ID along with the name of the person.
  • The cust ID will be my key of the key-value pair that my mapper will generate eventually.
  • I will also add a tag “cust” to indicate that this input tuple is of cust_details type.
  • Therefore, my mapper for cust_details will produce following intermediate key-value pair:

Key – Value pair: [cust ID, cust        name]

Example: [4000001, cust    Kristina], [4000002, cust   Paige], etc.

Mapper for transaction_details:

public static class TxnsMapper extends Mapper <Object, Text, Text, Text>
{
public void map(Object key, Text value, Context context) throws IOException, InterruptedException
{
String record = value.toString();
String[] parts = record.split(",");
context.write(new Text(parts[2]), new Text("tnxn    " + parts[3]));
}
}
  • Like mapper for cust_details, I will follow the similar steps here. Though, there will be a few differences:
    • I will fetch the amount value instead of name of the person.
    • In this case, we will use “tnxn” as a tag. 
  • Therefore, the cust ID will be my key of the key-value pair that the mapper will generate eventually.
  • Finally, the output of my mapper for transaction_details will be of the following format:

Key, Value Pair: [cust ID, tnxn   amount]

Example: [4000001, tnxn   40.33], [4000002, tnxn   198.44], etc.

2. Sorting and Shuffling Phase

The sorting and shuffling phase will generate an array list of values corresponding to each key. In other words, it will put together all the values corresponding to each unique key in the intermediate key-value pair. The output of sorting and shuffling phase will be of the following format:

Key – list of Values:

  • {cust ID1 – [(cust    name1), (tnxn    amount1), (tnxn    amount2), (tnxn    amount3),…..]}
  • {cust ID2 – [(cust    name2), (tnxn    amount1), (tnxn    amount2), (tnxn    amount3),…..]}
  • ……

Example:

  • {4000001 – [(cust    kristina), (tnxn    40.33), (tnxn    47.05),…]};
  • {4000002 – [(cust    paige), (tnxn    198.44), (tnxn     5.58),…]};
  • ……

Now, the framework will call reduce() method (reduce(Text key, Iterable<Text> values, Context context)) for each unique join key (cust id) and the corresponding list of values. Then, the reducer will perform the join operation on the values present in the respective list of values to calculate the desired output eventually. Therefore, the number of reducer task performed will be equal to the number of unique cust ID. 

Let us now understand how the reducer performs the join operation in this MapReduce example.

3. Reducer Phase

If you remember, the primary goal to perform this reduce-side join operation was to find out that how many times a particular customer has visited sports complex and the total amount spent by that very customer on different sports. Therefore, my final output should be of the following format:

       Key – Value pair: [Name of the customer] (Key) – [total amount, frequency of the visit] (Value)

Reducer Code:

public static class ReduceJoinReducer extends Reducer <Text, Text, Text, Text>
{
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException 
{
String name = "";
double total = 0.0;
int count = 0;
for (Text t : values) 
{ 
String parts[] = t.toString().split("  ");
if (parts[0].equals("tnxn")) 
{
count++;
total += Float.parseFloat(parts[1]);
} 
else if (parts[0].equals("cust")) 
{
name = parts[1];
}
}
String str = String.format("%d %f", count, total);
context.write(new Text(name), new Text(str));
}
}

So, following steps will be taken in each of the reducers to achieve the desired output:

  • In each of the reducer I will have a key & list of values where the key is nothing but the cust ID. The list of values will have the input from both the datasets i.e. Amount from transaction_details and name from cust_details.
  • Now, I will loop through the values present in the list of values in the reducer.
  • Then, I will split the list of values and check whether the value is of transaction_details type or cust_details type.
  • If it is of the transaction_details type, I will perform the following steps:
    • I will increase the counter value by one to calculate the frequency of visit by the very person.
    • I will cumulatively update the amount value to calculate the total amount spent by that person.
  • On the other hand, if the value is of cust_details type, I will store it in a string variable. Later, I will assign the name as my key  in my output key-value pair.
  • Finally, I will write the output key-value pair in the output folder in my HDFS.

Hence, the final output that my reducer will generate is given below:

Kristina, 651.05 8

Paige, 706.97  6

…..

And, this whole process that we did above is called Reduce Side Join in MapReduce. 

Source Code:

The source code for the above MapReduce example of the reduce side join is given below:

import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.MultipleInputs;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
 
 public class ReduceJoin {
 public static class CustsMapper extends Mapper <Object, Text, Text, Text>
 {
 public void map(Object key, Text value, Context context)
 throws IOException, InterruptedException 
 {
 String record = value.toString();
 String[] parts = record.split(",");
 context.write(new Text(parts[0]), new Text("cust   " + parts[1]));
 }
 }
 
 public static class TxnsMapper extends Mapper <Object, Text, Text, Text>
 {
 public void map(Object key, Text value, Context context) 
 throws IOException, InterruptedException 
 {
 String record = value.toString();
 String[] parts = record.split(",");
 context.write(new Text(parts[2]), new Text("tnxn   " + parts[3]));
 }
 }
 
 public static class ReduceJoinReducer extends Reducer <Text, Text, Text, Text>
 {
 public void reduce(Text key, Iterable<Text> values, Context context)
 throws IOException, InterruptedException 
 {
 String name = "";
 double total = 0.0;
 int count = 0;
 for (Text t : values) 
 { 
 String parts[] = t.toString().split("  ");
 if (parts[0].equals("tnxn")) 
 {
 count++;
 total += Float.parseFloat(parts[1]);
 } 
 else if (parts[0].equals("cust")) 
 {
 name = parts[1];
 }
 }
 String str = String.format("%d %f", count, total);
 context.write(new Text(name), new Text(str));
 }
 }
 
 public static void main(String[] args) throws Exception {
 Configuration conf = new Configuration();
 Job job = new Job(conf, "Reduce-side join");
 job.setJarByClass(ReduceJoin.class);
 job.setReducerClass(ReduceJoinReducer.class);
 job.setOutputKeyClass(Text.class);
 job.setOutputValueClass(Text.class);
  
 MultipleInputs.addInputPath(job, new Path(args[0]),TextInputFormat.class, CustsMapper.class);
 MultipleInputs.addInputPath(job, new Path(args[1]),TextInputFormat.class, TxnsMapper.class);
 Path outputPath = new Path(args[2]);
  
 FileOutputFormat.setOutputPath(job, outputPath);
 outputPath.getFileSystem(conf).delete(outputPath);
 System.exit(job.waitForCompletion(true) ? 0 : 1);
 }
 }

Run this Program

Finally, the command to run the above MapReduce example program on reduce side join is given below:

hadoop jar reducejoin.jar ReduceJoin /sample/input/cust_details /sample/input/transaction_details /sample/output

Conclusion:

The reduce side join procedure generates a huge network I/O traffic in the sorting and reducer phase where the values of the same key are brought together. So, if you have a large number of different data sets having millions of values, there is a high chance that you will encounter an OutOfMemory Exception i.e. Your RAM is full and therefore, overflown. In my opinion, the advantages of using reduce side join are:

  • It is very easy to implement as we are taking advantage of the inbuilt sorting and shuffling algorithm in the MapReduce framework which combines values of the same key and send it to the same reducer.  
  • In the reduce side join, your input does not require to follow any strict format and therefore, you can perform the join operation on unstructured data as well. 

In general, people prefer Apache Hive, which is a part of the Hadoop ecosystem, to perform the join operation. So, if you are from the SQL background, you don’t need to worry about writing the MapReduce Java code for performing a join operation. You can use Hive as an alternative.

Now that you have understood the Reduce Side Join with a MapReduce example, check out the Hadoop training by Edureka, a trusted online learning company with a network of more than 250,000 satisfied learners spread across the globe. The Edureka Big Data Hadoop Certification Training course helps learners become expert in HDFS, Yarn, MapReduce, Pig, Hive, HBase, Oozie, Flume and Sqoop using real-time use cases on Retail, Social Media, Aviation, Tourism, Finance domain.

Got a question for us? Please mention it in the comments section and we will get back to you.

Original article source at: https://www.edureka.co/

#hadoop #map 

Reduce Side Join in Hadoop MapReduce

Best 10 Reasons to Learn Hadoop

The Big Data Hadoop market is undergoing gigantic evolution and is showing no signs of deceleration. Big Data & Hadoop skills could be the transformation between your current career & your dream career. I would say, now is the right time to learn Hadoop.

The key reasons to learn hadoop are:

  1. Hadoop as a Gateway to Big Data Technologies
  2. Hadoop as a Disruptive Technology
  3. Various Big Data and Hadoop Job Profiles
  4. Caters different Professional Backgrounds
  5. Big Data & Hadoop Trend
  6. Hadoop for Fat Paycheck
  7. Scarcity of Big Data Hadoop Professionals
  8. Increasing Demand for Hadoop Professionals
  9. Big Data Revolutionizing Various Domains
  10. Big Data Analytics: A Top Priority in a Lot of Organizations

10. Big Data Analytics: A Top Priority in a Lot of Organizations

Big Data has been playing a role of big game changer in most of the industries over the last few years. In fact, Big Data has been adopted by a vast number of organisations belonging to various domains. By examining large data sets using Big Data Tools like Hadoop and Spark, they are able to identify different hidden patterns to find unknown correlations, market trends, customer preferences and other useful business information.

  • Big Data adoption reached 53% in 2017 up from 17% in 2015, with telecom and financial services leading early adopters.Forbes

The primary goal of Big Data Analytics is to help companies make better and effective business strategies by analysing large data volumes. The data sources include web server logs, Internet click-stream data, social media content and activity reports, text from customer emails, mobile phone call details, and machine data captured by sensors and connected to the Internet of Things (IoT).

Big Data analytics can lead to more effective marketing, new revenue opportunities, better customer services, improved operational efficiency, competitive advantages over rival organizations and other business benefits.

  • Commercial purchases of Big Data and Business Analytics (BDA) related hardware, software, and services are expected to maintain a compound annual growth rate (CAGR) of 11.9% through 2020 when revenues will be more than $210 billion.IDC

Growth of Unstructured Data - Learn Hadoop - Edureka

 

The above image clearly shows the tremendous increase in un-structured data (images, mail, audio, etc.) which can only be analysed by adopting Big Data Technologies such as Hadoop, Spark, Hive, etc. This has led to serious amount of skill gap w.r.t. available Big Data professionals in the current IT market. Hence, it is not at all surprising to see a lot buzz in the market to learn hadoop.

9. Big Data Revolutionizing Various Domains

Big Data is not leaving any stone unturned now a day. What I mean by this is that Big Data is present in each and every domain, allowing organisations to leverage it’s capability for improving their buisness values. The most common domain which are rigorously using Big Data and Hadoop are healthcare, retail, goverenment, banking, media & entertainment, transportation, natural resources and so on as shown in the image below:

Big Data in Different Domains - Learn Hadoop - Edureka

Hence, you can build your career into any of these domin by learning Hadoop. Further, you can go through this Big Data Applications blog to understand how Big Data is revolutionizing different domains.

8. Increasing Demand for Hadoop Professionals

“Hadoop Market is expected to reach $99.31B by 2022 at a CAGR of 42.1%” – Forbes

The demand of Hadoop can be directly attributed with the fact that it is one of the most prominent technology that can handle Big Data, and is quite cost effective & scalable. With the swift increase in Big Data sources & amount of data, Hadoop has become more of a foundation for other Big Data technologies evolving around it such as Spark, Hive, etc. This is generating a large number of Hadoop jobs at a very steep rate. 

Rise in Demand of Hadoop - Learn Hadoop - Edureka

Increase in demand of YARN (Source: Forbes)

7. Scarcity of Big Data Hadoop Professionals

As we discussed, Hadoop job opportunities are growing at a high pace. But most of these job roles are still vacant due to a huge skill gap still persisting in the market. Such scarcity of proper skill set for Big Data and Hadoop Technology has created a vast gap between supply and demand chain. 

Hence, now is the right time for you to step ahead and start your journey towards building a bright career in Big Data & Hadoop. In fact, the famous saying – “Now or Never” is an apt description that explains the current opportunities in the Big Data and Hadoop market. 

6. Learn Hadoop for Fat Paycheck


Salary of Hadoop Professionals - Learn Hadoop - Edureka

One of the captivating reason to learn Hadoop is the fat paycheck. The scarcity of the Hadoop professionals is one of the major reasons behind their high salary. According to payscale.com, the salary of Hadoop professionals variates from $93K to $127K anually, based on differnt job roles. 

5. Big Data & Hadoop Trend

As per the Google trends, hadoop has a stable graph in past 5 years. One more important thing to notice is, the trend of Big Data & Hadoop are tightly coupled with each other. Big Data is something which talks about the problem that is associated with the storage, curation, processing and analytics of the data. Hence, it is quite evident that all of the companies need to tackle the big data problem one way or another for making better business decisions.  

Big Data Hadoop Trend - Learn Hadoop - Edureka

 

Hence, one can clearly deduce that Big Data & Hadoop has a promising future and is not something that is going to vanish into thin air atleast in the next 20 years.

4. Caters different Professional Backgrounds

Hadoop Ecosystem has various tools which can be leveraged by professional from different backgrounds. If you are from programming background, you can write MapReduce code in different languages like Java, Python, etc. If you are exposed to scripting language, Apache Pig is the best fit for you. Alternatively, if you are comfortable with SQL then you can go ahead with Apache Hive or Apache Drill.

The market for Big Data analytics is growing across the world and this strong growth pattern translates into a great opportunity for all the IT Professionals. It is best suited for:

  • Software Developers, Project Managers
  • Software Architects
  • ETL and Data Warehousing Professionals
  • Analytics & Business Intelligence Professionals
  • DBAs and DB professionals
  • Senior IT Professionals
  • Testing professionals
  • Mainframe professionals
  • Graduates looking to build a career in Big Data Field

3. Various Big Data and Hadoop Job Profiles

There are various job profiles in Big Data & Hadoop. You can pursue any one on them based on your professional background. Some of the famous job roles in Hadoop are:

  • Hadoop Developer
  • Hadoop Admin
  • Data Analyst
  • Big Data Architect
  • Software Engineer
  • Senior Software Engineer
  • Data Engineer
  • Data Scientist

2. Hadoop as a Disruptive Technology

Hadoop has proven itself as a better alternative than that of traditional data warehousing systems in terms of cost, scalability, storage and performance over variety of data sources. In fact, Hadoop has revolutionized the way data is processed now a days and has brought a drastic change in the field of data analytics. Besides this, Hadoop ecosystem is going through continuous experimentation & enhancements. In a nutshell, Big Data & Hadoop is taking out the world by storm & if you don’t want to get affected, you have to ride with the tide.

1. Hadoop as a Gateway to Big Data Technologies

Hadoop Ecosystem - Learn Hadoop - Edureka

Hadoop has become a de-facto for Big Data analytics and has been adopted by large number of companies. Typically, beside Hadoop, a Big Data solution strategy involves multiple technologies in a tailored manner. So, it is essential for one to not only learn Hadoop but become expert on other Big Data technologies falling under the Hadoop ecosystem. This will help you to further boost your Big Data career and grab elite roles like Big Data Architect, Data Scientist, etc. But for all of this you need to learn Hadoop as it is the stepping stone for moving into Big Data domain. 

I hope that you would have found this blog informative. If you want to learn Hadoop, you can start with this Big Data & Hadoop blog series. 

Now that you have understood the reasons to learn Hadoop, check out the Hadoop training by Edureka, a trusted online learning company with a network of more than 250,000 satisfied learners spread across the globe. The Edureka Big Data Hadoop Certification Training course helps learners become expert in HDFS, Yarn, MapReduce, Pig, Hive, HBase, Oozie, Flume and Sqoop using real-time use cases on Retail, Social Media, Aviation, Tourism, Finance domain.

Got a question for us? Please mention it in the comments section and we will get back to you.

Top 10 Reasons To Learn Hadoop | Hadoop Certification | Edureka

Let us cover them in descending order.

Original article source at: https://www.edureka.co/

#hadoop #bigdata 

Best 10 Reasons to Learn Hadoop
Monty  Boehm

Monty Boehm

1669431840

Best 4 Ways To Use R And Hadoop Together

Hadoop is a disruptive Java-based programming framework that supports the processing of large data sets in a distributed computing environment, while R is a programming language and software environment for statistical computing and graphics. The R language is widely used among statisticians and data miners for developing statistical software and performing data analysis. In the areas of interactive data analysis, general purpose statistics and predictive modelling, R has gained massive popularity due to its classification, clustering and ranking capabilities.

Hadoop and R complement each other quite well in terms of visualization and analytics of big data.

Using R and Hadoop

There are four different ways of using Hadoop and R together:

1. RHadoop

RHadoop is a collection of three R packages: rmr, rhdfs and rhbase. rmr package provides Hadoop MapReduce functionality in R, rhdfs provides HDFS file management in R and rhbase provides HBase database management from within R. Each of these primary packages can be used to analyze and manage Hadoop framework data better.

2. ORCH

ORCH stands for Oracle R Connector for Hadoop. It is a collection of R packages that provide the relevant interfaces to work with Hive tables, the Apache Hadoop compute infrastructure, the local R environment, and Oracle database tables. Additionally, ORCH also provides predictive analytic techniques that can be applied to data in HDFS files.


3. RHIPE

RHIPE is a R package which provides an API to use Hadoop. RHIPE stands for R and Hadoop Integrated Programming Environment, and is essentially RHadoop with a different API.

4. Hadoop streaming

Hadoop Streaming is a utility which allows users to create and run jobs with any executables as the mapper and/or the reducer. Using the streaming system, one can develop working Hadoop jobs with just enough knowledge of Java to write two shell scripts that work in tandem.

The combination of R and Hadoop is emerging as a must-have toolkit for people working with statistics and large data sets. However, certain Hadoop enthusiasts have raised a red flag while dealing with extremely large Big Data fragments. They claim that the advantage of R is not its syntax but the exhaustive library of primitives for visualization and statistics. These libraries are fundamentally non-distributed, making data retrieval a time-consuming affair. This is an inherent flaw with R, and if you choose to overlook it, R and Hadoop in tandem can still work wonders.

Now, let’s see a demo:

demo-1-r-and-hadoop

demo-2-r-and-hadoop

Original article source at: https://www.edureka.co/

#r #hadoop 

Best 4 Ways To Use R And Hadoop Together
Gordon  Taylor

Gordon Taylor

1669285021

How to Transfer files from Windows to Cloudera Demo VM

This blog describes step by step procedure to  transfer files from windows to Cloudera Demo VM. To achieve this task, you need an FTP (File Transfer Protocol) software such as FileZilla or WinSCP. In  this blog, we will use FileZilla to demonstrate the whole procedure.

Tutorial for Transfering files from Windows to Cloudera Demo VM

Step1: Download and Install FileZilla

  •  
    • Download and install FileZilla for Windows from this link.
  • Open FileZilla. The following screen will appear:

 

Step 2: Establish Connection with Cloudera

To establish the connection we need  four parameters:

  • Hostname:  In this field we should give the IP address of Cloudera.
  • Username: This is the username of Cloudera Demo VM. By default it is ‘cloudera ’.
  • Password:  By default the password for Cloudera demo VM is ‘cloudera’.
  • Port Number: We need to mention the port number to access the file transfer service on the Cloudera Demo VM. As it is a SSH connection, use the port number ‘22’.

Find the IP address of the host in Cloudera Demo VM. Open a terminal in Cloudera and execute the following command: ifconfig

It will display the host IP address as shown in the following image:

Step 2: Establish Connection with Cloudera

The circled number in the image is the IP address of your Cloudera Host.

Now, we have all the four values that need to be specified for the Windows and FileZilla connection.
The values are:

Host: 192.168.126.174
Username: cloudera
Password: cloudera
Port Number: 22

Update these parameters in the appropriate fields of FileZilla and click on Quick Connect as shown in the above image.

Once you click on Quick connect a message will pop up as shown in the below image.

Step 2: Establish Connection with Cloudera

Click OK.

You will receive a message informing the successful connection.

Filezilla-3

You can observe in the above figure that under the left side  panel, it lists the directories and files present in that directory of Windows and under the right side panel of the FileZilla, it lists the directory and files present in that directory of Cloudera.

Step 3: Transferring the File to Cloudera.

Select the directory on your local system that contains the file(s) you would like to transfer to Cloudera. We will transfer the file “input.txt” present in location ‘D: sample’ to Cloudera VM host.

Similarly, select the location/directory of Cloudera to which you would like to transfer the “input.txt” file. We will transfer the file to Desktop of Cloudera host.

Step 3: Transferring the File to Cloudera.

Right Click on input.txt file click the option “upload”.

Step 3: Transferring the File to Cloudera.

Observe that the fileinput.txt under the Cloudera Host Desktop as shown in the following image. You will also receive the success status image as highlighted in the image.

Step 3: Transferring the File to Cloudera.

Congratulations! You have successfully transferred the files from your Windows PC to Cloudera Demo VM host.

Got a question for us? Please mention them in the comments section.

Original article source at: https://www.edureka.co/

#windows #files #hadoop 

How to Transfer files from Windows to Cloudera Demo VM
Monty  Boehm

Monty Boehm

1669203060

How to Create Your First Apache Pig Script

Pig Programming: Create Your First Apache Pig Script

In our Hadoop Tutorial Series, we will now learn how to create an Apache Pig script. Apache Pig scripts are used to execute a set of Apache Pig commands collectively. This helps in reducing the time and effort invested in writing and executing each command manually while doing this in Pig programming. It is also an integral part of the Hadoop course curriculum. This blog is a step by step guide to help you create your first Apache Pig script.

Apache Pig script Execution Modes

Local Mode: In ‘local mode’, you can execute the pig script in local file system. In this case, you don’t need to store the data in Hadoop HDFS file system, instead you can work with the data stored in local file system itself.

MapReduce Mode: In ‘MapReduce mode’, the data needs to be stored in HDFS file system and you can process the data with the help of pig script.

Apache Pig Script in MapReduce Mode

Let us say our task is to read data from a data file and to display the required contents on the terminal as output.

The sample data file contains following data:

Information txt file - Apache Pig Script - Edureka

Save the text file with the name ‘information.txt’

The sample data file contains five columns FirstName, LastName, MobileNo, City, and Profession separated by tab key. Our task is to read the content of this file from the HDFS and display all the columns of these records.

To process this data using Pig, this file should be present in Apache Hadoop HDFS.

Command: hadoop fs –copyFromLocal /home/edureka/information.txt /edureka

Copy Data into HDFS - Apache Pig Script - Edureka

Step 1: Writing a Pig script

Create and open an Apache Pig script file in an editor (e.g. gedit).

Command: sudo gedit /home/edureka/output.pig

This command will create a ‘output.pig’ file inside the home directory of edureka user.

Create Pig Latin script - Apache Pig Script - Edureka

Let’s write few PIG commands in output.pig file.

A = LOAD ‘/edureka/information.txt’ using PigStorage (‘ ’) as (FName: chararray, LName: chararray, MobileNo: chararray, City: chararray, Profession: chararray);
 
B = FOREACH A generate FName, MobileNo, Profession;
 
DUMP B;

Save and close the file.

  • The first command loads the file ‘information.txt’ into variable A with indirect schema (FName, LName, MobileNo, City, Profession).
  • The second command loads the required data from variable A to variable B.
  • The third line displays the content of variable B on the terminal/console.

Step 2: Execute the Apache Pig Script

To execute the pig script in HDFS mode, run the following command:

Command: pig /home/edureka/output.pig

Executing Pig Script - Apache Pig Script - Edureka

After the execution finishes, review the result. These below images show the results and their intermediate map and reduce functions.

Below image shows that the Script executed successfully.

Result 1 - Apache Pig Script - Edureka

Below image shows the result of our script.

Result 3 - Apache Pig Script - Edureka

Congratulations on executing your first Apache Pig script successfully!

Now you know, how to create and execute Apache Pig script. Hence, our next blog in Hadoop Tutorial Series will be covering how to create UDF (User Defined Functions) in Apache Pig and execute it in MapReduce/HDFS mode.

Now that you have created and executed Apache Pig Script, check out the Hadoop training by Edureka, a trusted online learning company with a network of more than 250,000 satisfied learners spread across the globe. The Edureka Big Data Hadoop Certification Training course helps learners become expert in HDFS, Yarn, MapReduce, Pig, Hive, HBase, Oozie, Flume and Sqoop using real-time use cases on Retail, Social Media, Aviation, Tourism, Finance domain.

Got a question for us? Please mention it in the comments section and we will get back to you.

Original article source at: https://www.edureka.co/

#hadoop #script #programming 

How to Create Your First Apache Pig Script
Monty  Boehm

Monty Boehm

1669194360

How to Create your First HIVE Script

Apache Hadoop : Create your First HIVE Script

As is the case with scripts in other languages such as SQL, Unix Shell etc., Hive scripts are used to execute a set of Hive commands collectively. This helps in reducing the time and effort invested in writing and executing each command manually. This blog is a step by step guide to write your first Hive script and executing it.Check out this Big Data Course to learn more about Hive scripts and Commands in real projects.

Hive supports scripting from Hive 0.10.0 and above versions. Cloudera distribution for hadoop (CDH4) quick VM comes with pre-installed Hive 0.10.0 (CDH3 Demo VM uses Hive 0.90 and hence, cannot run Hive Scripts).

Execute the following steps to create your first Hive Script:

Step1: Writing a script

Open a terminal in your Cloudera CDH4 distribution and give the below command to create a Hive Script.

command: gedit sample.sql

The Hive script file should be saved with .sql extension to enable the execution.

Edit the file and write few Hive commands that will be executed using this script.

In this sample script, we will create a table, describe it, load the data into the table and retrieve the data from this table.

Create a table ‘product’ in Hive:

command: create table product ( productid: int, productname: string, price: float, category: string) rows format delimited fields terminated by ‘,’ ;

Here { productid, productname, price, category} are the columns in the ‘product’ table.

Fields terminated by ‘,’ ” indicates that the columns in the input file are separated by the  ‘,’ delimiter.  You can use other delimiters also. For example, the records in an input file can be separated by a new line (‘
’) character.

Describe the Table :

command: describe product;

Load the data into the Table:

To load the data into the table, create an input file which contains the records that needs to be inserted into the table.

command: sudo gedit input.txt

Create few records in the input text file as shown in the figure.

Command: load data local inpath ‘/home/cloudera/input.txt’ into table product;

Retrieving the data:

To retrieve the data use select command.

command: select * from product;

The above command will retrieve all the records from the table ‘product’.

The script should look like as shown in the following image:

 

SQL Query - Apache Hadoop Hive Script - EdurekaSave the sample.sql file and close the editor. You are now ready to execute your first Hive script.

Step 2: Execute the Hive Script

Execute the hive script using the following command:

Command: hive –f /home/cloudera/sample.sql

While executing the script, make sure that you give the entire path of the script location. As the sample script is present in the current directory, I haven’t provided the complete path of the script.

The following image shows that all the commands were executed successfully.

example

Congratulations on executing your first Hive script successfully!!!!. This Hive script knowledge is necessary to clear Big data certifications.

Original article source at: https://www.edureka.co/

#hive #hadoop #script 

How to Create your First HIVE Script
Monty  Boehm

Monty Boehm

1668586200

Top Hadoop Interview Questions and Answers

Top Answers to Hadoop Interview Questions

Big Data Hadoop professionals are among the highest-paid IT professionals in the world today. In this blog, you will come across a compiled list of the most probable Big Data Hadoop questions that are asked by recruiters during the recruitment process. Check out these popular Big Data Hadoop interview questions.

Basic Interview Questions

1. What do you mean by the term or concept of Big Data?

Big Data means a set or collection of large datasets that keeps on growing exponentially. It is difficult to manage Big Data with traditional data management tools. Examples of Big Data include the amount of data generated by Facebook or Stock Exchange Board of India on a daily basis. There are three types of Big Data:

  • Structured Big Data
  • Unstructured Big Data
  • Semi-structured Big Data

2. What are the characteristics of Big Data?

The characteristics of Big Data are as follows:

  • Volume
  • Variety
  • Velocity
  • Variability

Where,

Volume means the size of the data, as this feature is of utmost importance while handling Big Data solutions. The volume of Big Data is usually high and complex.

Variety refers to the various sources from which data is collected. Basically, it refers to the types, structured, unstructured, and semi-structured, and heterogeneity of Big Data.

Velocity means how fast or slow the data is getting generated. Basically, Big Data velocity deals with the speed at which the data is generated from business processes, operations, application logs, etc.

Variability, as the name suggests, means how differently the data behaves in different situations or scenarios in a given period of time.

3. What are the various steps involved in deploying a Big Data solution?

Deploying a Big Data solution includes the following steps:

  • Data Ingestion: As a first step, the data is drawn out or extracted from various sources so as to feed it to the system.
  • Data Storage: Once data ingestion is completed, the data is stored in either HDFS or NoSQL database.
  • Data Processing: In the final step, the data is processed through frameworks and tools such as Spark, MapReduce, Pig, etc.

4. What is the reason behind using Hadoop in Big Data analytics?

Businesses generate a lot of data in a single day and the data generated is unstructured in nature. Data analysis with unstructured data is difficult as it renders traditional big data solutions ineffective. Hadoop comes into the picture when the data is complex, large and especially unstructured. Hadoop is important in Big Data analytics because of its characteristics:

  • Data storage
  • Data processing
  • Collection plus extraction of data

5. What do you understand by fsck in Hadoop?

fsck stands for file system check in Hadoop, and is a command that is used in HDFS. fsck checks any and all data inconsistencies. If the command detects any inconsistency, HDFS is notified regarding the same.

6. Can you explain some of the important features of Hadoop?

Some of the important features of Hadoop are:

Fault Tolerance: Hadoop has a high-level of fault tolerance. To tackle faults, Hadoop, by default, creates three replicas for each block at different nodes. This number can be modified as per the requirements. This helps to recover the data from another node if one node has failed. Hadoop also facilitates automatic recovery of data and node detection.

Open Source: One of the best features of Hadoop is that it is an open-source framework and is available free of cost. Hadoop also allows its users to change the source code as per their requirements.

Distributed Processing: Hadoop stores the data in a distributed manner in HDFS. Distributed processing implies fast data processing. Hadoop also uses MapReduce for the parallel processing of the data.

Reliability: One of the benefits of Hadoop is that the data stored in Hadoop is not affected by any kind of machine failure, which makes Hadoop a reliable tool.

Scalability: Scalability is another important feature of Hadoop. Hadoop’s compatibility with other hardware makes it a preferred tool. You can also easily add new hardware to the nodes in Hadoop.

High Availability: Easy access to the data stored in Hadoop makes it a highly preferred Big Data management solution. Not only this, the data stored in Hadoop can be accessed even if there is a hardware failure as it can be accessed from a different path.

7. What is Hadoop and what are its components?

Apache Hadoop is the solution for dealing with Big Data. Hadoop is an open-source framework that offers several tools and services to store, manage, process, and analyze Big Data. This allows organizations to make significant business decisions in an effective and efficient manner, which was not possible with traditional methods and systems.
There are 3 main components of Hadoop. They are :

  • HDFS
  • YARN
  • MapReduce

HDFS

It is a system that allows you to distribute the storage of big data across a cluster of computers. Italso maintains the redundant copies of data.So, if one of your computers happens to randomly burst into flames or if some technical issues occur, HDFS can actually recover from that by creating a backup from a copy of the data that it had saved automatically, and you won’t even know if anything happened.

YARN

Next in the Hadoop ecosystem is YARN (Yet Another Resource Negotiator). It is the place where the data processing of Hadoop comes into play. YARN is a system that manages the resources on your computing cluster. It is the one that decides who gets to run the tasks, when and what nodes are available for extra work, and which nodes are not available to do so.

MapReduce

MapReduce, the next component of the Hadoop ecosystem, is just a programming model that allows you to process your data across an entire cluster. It basically consists of Mappers and Reducers that are different scripts, which you might write, or different functions you might use when writing a MapReduce program.

8. Explain Hadoop Architecture.

The Hadoop Architecture comprises of the following :

  • Hadoop Common
  • HDFS
  • MapReduce
  • YARN

Hadoop Common

Hadoop Common is a set of utilities that offers support to the other three components of Hadoop. It is a set of Java libraries and scripts that are required by MapReduce, YARN, and HDFS to run the Hadoop cluster.

HDFS

HDFS stands for Hadoop Distributed File System. It stores data in the form of small memory blocks and distributes them across the cluster. Each data is replicated multiple times to ensure data availability. It has two daemons. One for master node一 NameNode and their for slave nodes ―DataNode.

NameNode and DataNode : The NameNode runs on the master server. It manages the Namespace and regulates file access by the client. The DataNode runs on slave nodes. It stores the business data.

MapReduce

It executes tasks in a parallel fashion by distributing the data as small blocks. The two most important tasks that the Hadoop MapReduce carries out are Mapping the tasks and Reducing the tasks.

YARN

It allocates resources which in turn allow different users to execute various applications without worrying about the increased workloads.

9. In what all modes can Hadoop be run?

Hadoop can be run in three modes:

three modes

  • Standalone Mode: The default mode of Hadoop, standalone mode uses a local file system for input and output operations. This mode is mainly used for debugging purposes, and it does not support the use of HDFS. Further, in this mode, there is no custom configuration required for mapred-site.xml, core-site.xml, and hdfs-site.xml files. This mode works much faster when compared to other modes.
  • Pseudo-distributed Mode (Single-node Cluster): In the case of pseudo-distributed mode, you need the configuration for all the three files mentioned above. All daemons are running on one node; thus, both master and slave nodes are the same.
  • Fully distributed mode (Multi-node Cluster): This is the production phase of Hadoop, what it is known for, where data is used and distributed across several nodes on a Hadoop cluster. Separate nodes are allotted as master and slave nodes.

10. Name some of the major organizations globally that use Hadoop?

Some of the major organizations globally that are using Hadoop as a Big Data tool are as follows:

  • Netflix
  • Uber
  • The National Security Agency (NSA) of the United States
  • The Bank of Scotland
  • Twitter

11. What are the real-time industry applications of Hadoop?

Hadoop, well known as Apache Hadoop, is an open-source software platform for scalable and distributed computing of large volumes of data. It provides rapid, high-performance, and cost-effective analysis of structured and unstructured data generated on digital platforms and within the organizations. It is used across all departments and sectors today.

Here are some of the instances where Hadoop is used:

  • Managing traffic on streets
  • Streaming processing
  • Content management and archiving emails
  • Processing rat brain neuronal signals using a Hadoop computing cluster
  • Fraud detection and prevention
  • Advertisements targeting platforms are using Hadoop to capture and analyze clickstream, transaction, video, and social media data
  • Managing content, posts, images, and videos on social media platforms
  • Analyzing customer data in real-time for improving business performance
  • Public sector fields such as intelligence, defense, cyber security, and scientific research
  • Getting access to unstructured data such as output from medical devices, doctor’s notes, lab results, imaging reports, medical correspondence, clinical data, and financial data

12. What is HBase?

Apache HBase is a distributed, open-source, scalable, and multidimensional database of NoSQL. HBase is based on Java; it runs on HDFS and offers Google-Bigtable-like abilities and functionalities to Hadoop. Moreover, HBase’s fault-tolerant nature helps in storing large volumes of sparse datasets. HBase gets low latency and high throughput by offering faster access to large datasets for read or write functions.

13. What is a Combiner?

A combiner is a mini version of a reducer that is used to perform local reduction processes. The mapper sends the input to a specific node of the combiner, which later sends the respective output to the reducer. It also reduces the quantum of the data that needs to be sent to the reducers for improving the efficiency of MapReduce.

14. Is it okay to optimize algorithms or codes to make them run faster? If yes, why?

Yes, it is always suggested and recommended to optimize algorithms or codes to make them run faster. The reason for this is that optimized algorithms are pretrained and have an idea about the business problem. The higher the optimization, the higher the speed.

15. What is the difference between RDBMS and Hadoop?

Following are some of the differences between RDBMS (Relational Database Management) and Hadoop based on various factors:

 RDBMSHadoop
Data TypesIt relies on structured data and the data schema is always known.Hadoop can store structured, unstructured, and semi-structured data.
CostSince it is licensed, it is paid software.It is a free open-source framework.
ProcessingIt offers little to no capabilities for processing.It supports data processing for data distributed in a parallel manner across the cluster.
Read vs Write SchemaIt follows ‘schema on write’, allowing the validation of schema to be done before data loading.It supports the policy of schema on read.
Read/Write SpeedReads are faster since the data schema is known.Writes are faster since schema validation does not take place during HDFS write.
Best Use CaseIt is used for Online Transactional Processing (OLTP) systems.It is used for data analytics, data discovery, and OLAP systems.

16. What is Apache Spark?

Apache Spark is an open-source framework engine known for its speed and ease of use in Big Data processing and analysis. It also provides built-in modules for graph processing, machine learning, streaming, SQL, etc. The execution engine of Apache Spark supports in-memory computation and cyclic data flow. It can also access diverse data sources such as HBase, HDFS, Cassandra, etc.
17. Can you list the components of Apache Spark?

 

The components of the Apache Spark framework are as follows:

  • Spark Core Engine
  • Spark Streaming
  • Mllib
  • GraphX
  • Spark SQL
  • Spark R

One thing that needs to be noted here is that it is not necessary to use all Spark components together. But yes, the Spark Core Engine can be used with any of the other components listed above.

 

18. What are the differences between Hadoop and Spark?

 

CriteriaHadoopSpark
Dedicated storageHDFSNone
Speed of processingAverageExcellent
LibrariesSeparate tools availableSpark Core, SQL, Streaming, MLlib, and GraphX

 

19. What is Apache Hive?

Apache Hive is an open-source tool or system in Hadoop; it is used for processing structured data stored in Hadoop. Apache Hive is the system responsible for facilitating analysis and queries in Hadoop. One of the benefits of using Apache Hive is that it helps SQL developers to write Hive queries almost similar to the SQL statements that are given for analysis and querying data.

20. Does Hive support multiline comments?

No. Hive does not support multiline comments. It only supports single-line comments as of now.

21. Explain the major difference between HDFS block and InputSplit

In simple terms, HDFS block is the physical representation of data, while InputSplit is the logical representation of the data present in the block. InputSplit acts as an intermediary between the block and the mapper.

Suppose there are two blocks:

Block 1: ii nntteell

Block 2: Ii ppaatt

Now considering the map, it will read Block 1 from ii to ll but does not know how to process Block 2 at the same time. InputSplit comes into play here, which will form a logical group of Block 1 and Block 2 as a single block.

It then forms a key-value pair using InputFormat and records the reader and sends the map for further processing with InputSplit. If you have limited resources, then you can increase the split size to limit the number of maps. For instance, if there are 10 blocks of 640 MB, 64 MB each, and limited resources, then you can assign the split size as 128 MB. This will form a logical group of 128 MB, with only five maps executing at a time.

However, if the split size property is set to false, then the whole file will form one InputSplit and will be processed by a single map, consuming more time when the file is bigger.

Learn end-to-end Hadoop concepts through the Hadoop Course in Hyderabad to take your career to a whole new level!

Intermediate Interview Questions

22. What is the Hadoop Ecosystem?

 

Hadoop Ecosystem is a bundle or a suite of all the services that are related to the solution of Big Data problems. It is precisely speaking, a platform consisting of various components and tools that function jointly to execute Big Data projects and solve the issues therein. It consists of Apache projects and various other components that together constitute the Hadoop Ecosystem.

 

23. What is Hadoop Streaming?

 

Hadoop Streaming is one of the ways that are offered by Hadoop for non-Java development. Hadoop Streaming helps you to write MapReduce program in any language which can write to standard output and read standard input.The primary mechanisms are Hadoop Pipes which gives a native C++ interface to Hadoop and Hadoop Streaming which permits any program that uses standard input and output to be used for map tasks and reduce tasks. With the help of Hadoop Streaming, one can create and run MapReduce jobs with any executable or script as the mapper and/or the reducer.

 

24. How is Hadoop different from other parallel computing systems?

 

Hadoop is a distributed file system that lets you store and handle large amounts of data on a cloud of machines, handling data redundancy.

The primary benefit of this is that since the data is stored in several nodes, it is better to process it in a distributed manner. Each node can process the data stored on it, instead of spending time moving the data over the network.

On the contrary, in the relational database computing system, you can query the data in real-time, but it is not efficient to store the data in tables, records, and columns, when the data is large.

Hadoop also provides a scheme to build a column database with Hadoop HBase for runtime queries on rows.

Listed below are the main components of Hadoop:

  • HDFS: HDFS is Hadoop’s storage unit.
  • MapReduce: MapReduce the Hadoop’s processing unit.
  • YARN: YARN is the resource management unit of Apache Hadoop.

 Learn more about Hadoop through Intellipaat’s Hadoop Training.

25. Can you list the limitations of Hadoop?

Hadoop is considered a very important Big Data management tool. However, like other tools, it also has some limitations of its own. They are as below:

  • In Hadoop, you can configure only one NameCode.
  • Hadoop is suitable only for the batch processing of a large amount of data.
  • Only map or reduce jobs can be run by Hadoop.
  • Hadoop supports only one Name No and One Namespace for each cluster.
  • Hadoop does not facilitate horizontal scalability of NameNode.
  • Hourly backup of MetaData from NameNode needs to be given to the Secondary NameNode.
  • Hadoop can support only up to 4000 nodes per cluster.
  • In Hadoop, the JobTracker, one and only single component, performs a majority of the activities such as managing Hadoop resources, job schedules, job monitoring, rescheduling jobs, etc.
  • Real-time data processing is not possible with Hadoop.
  • Due to the preceding reason, JobTracker is the only possible single point of failure in Hadoop.

Watch this insightful video to learn more about Hadoop:

26. What is distributed cache? What are its benefits?

Distributed cache in Hadoop is a service by MapReduce framework to cache files when needed.

Once a file is cached for a specific job, Hadoop will make it available on each DataNode both in the system and in the memory, where map and reduce tasks are executed. Later, you can easily access and read the cache files and populate any collection, such as an array or hashmap, in your code.

Distributed Cache

The benefits of using distributed cache are as follows:

  • It distributes simple, read-only text/data files and/or complex files such as jars, archives, and others. These archives are then un-archived at the slave node.
  • Distributed cache tracks the modification timestamps of cache files, which notify that the files should not be modified until a job is executed.

Learn more about MapReduce from this MapReduce Tutorial now!

27. Name the different configuration files in Hadoop

Below given are the names of the different configuration files in Hadoop:

  • mapred-site.xml
  • core-site.xml
  • hdfs-site.xml
  • yarn-site.xml

28. Can you skip the bad records in Hadoop? How?

In Hadoop, there is an option where sets of input records can be skipped while processing map inputs. This feature is managed by the applications through the SkipBadRecords class.

The SkipBadRecords class is commonly used when map tasks fail on input records. Please note that the failure can occur due to faults in the map function. Hence, the bad records can be skipped in Hadoop by using this class.

29. What are the various components of Apache HBase?

There are three main components of Apache HBase that are mentioned below:

  • HMaster: It manages and coordinates the region server just like NameNode manages DataNodes in HDFS.
  • Region Server: It is possible to divide a table into multiple regions and the region server makes it possible to serve a group of regions to the clients.
  • ZooKeeper: ZooKeeper is a coordinator in the distributed environment of HBase. ZooKeeper communicates through the sessions to maintain the state of the server in the cluster.

30. What is the syntax to run a MapReduce program?

The syntax used to run a MapReduce program is hadoop_jar_file.jar /input_path /output_path.

31. Which command will you give to copy data from the local system onto HDFS?

hadoop fs –copyFromLocal [source][destination]

32. What are the components of Apache HBase’s Region Server?

The following are the components of HBase’s region server:

  • BlockCache: It resides on the region server and stores data in the memory, which is read frequently.
  • WAL: Write ahead log or WAL is a file that is attached to each region server located in the distributed environment.
  • MemStore: MemStore is the write cache that stores the input data before it is stored in the disk or permanent memory.
  • HFile: HDFS stores the HFile that stores the cells on the disk.

33. What are the various schedulers in YARN?

Mentioned below are the numerous schedulers that are available in YARN:

  • FIFO Scheduler: The first-in-first-out (FIFO) scheduler places all the applications in a single queue and executes them in the same order as their submission. As the FIFO scheduler can block short applications due to long-running applications, it is less efficient and desirable for professionals.
  • Capacity Scheduler: A different queue makes it possible to start executing short-term jobs as soon as they are submitted. Unlike in the FIFO scheduler, the long-term tasks are completed later in the capacity scheduler.
  • Fair Scheduler: The fair scheduler, as the name suggests, works fairly. It balances the resources dynamically between all the running jobs and is not required to reserve a specific capacity for them.

34. What are the main components of YARN? Can you explain them?

The main components of YARN are explained below:

  • Resource Manager: It runs on a master daemon and is responsible for controlling the resource allocation in the concerned cluster.
  • Node Manager: It is responsible for executing a task on every single data node. Node manager also runs on the slave daemons in Hadoop.
  • Application Master: It is an important component of YARN as it controls the user job life cycle and the resource demands of single applications. The application master works with the node manager to monitor the task execution.
  • Container: It is like a combination of the Hadoop resources, which may include RAM, network, CPU, HDD, etc., on one single node.

35. Explain the difference among NameNode, Checkpoint NameNode, and Backup Node

  • NameNode is the core of HDFS. NameNode manages the metadata. In simple terms, NameNode is the data about the data being stored. It supports a directory tree-like structure consisting of all the files present in HDFS on a Hadoop cluster. NameNode uses the following files for namespace:
    • fsimage file: It keeps track of the latest checkpoint of the namespace.
    • edits file: It is a log of changes that have been made to the namespace since the checkpoint.

NameNode

  • Checkpoint NameNode has the same directory structure as NameNode. Checkpoint NameNode creates checkpoints for namespace at regular intervals by downloading the fsimage and editing files and margining them within the local directory. The new image after merging is then uploaded to NameNode. There is a similar node to Checkpoint, commonly known as the Secondary Node, but it does not support the upload-to-NameNode functionality.

Backup Node

  • Backup Node executes the online streaming of the File system edits transaction in the Primary Namenode. It is also responsible for implementing the Checkpoint functionality and acts as the dynamic backup for the Filesystem Namespace (Metadata) in the Hadoop system.

Go through this HDFS Tutorial to know how the distributed file system works in Hadoop!

36. What are the most common input formats in Hadoop?

There are three most common input formats in Hadoop:

  • Text Input Format: Default input format in Hadoop
  • Key-value Input Format: Used for plain text files where the files are broken into lines
  • Sequence File Input Format: Used for reading files in sequence

37. What are the most common output formats in Hadoop?

The following are the commonly used output formats in Hadoop:

  • Textoutputformat: TextOutputFormat is by default the output format in Hadoop.
  • Mapfileoutputformat: Mapfileoutputformat writes the output as map files in Hadoop.
  • DBoutputformat: DBoutputformat writes the output in relational databases and Hbase.
  • Sequencefileoutputformat: Sequencefileoutputformat is used in writing sequence files.
  • SequencefileAsBinaryoutputformat: SequencefileAsBinaryoutputformat is used in writing keys to a sequence file in binary format.

38. How to execute a Pig script?

The three methods listed below enable users to execute a Pig script:

  • Grunt shell
  • Embedded script
  • Script file

39. What is Apache Pig and why is it preferred over MapReduce?

Apache Pig is a Hadoop-based platform that allows professionals to analyze large sets of data and represent them as data flows. Pig reduces the complexities that are required while writing a program in MapReduce, giving it an edge over MapReduce.

The following are some of the reasons why Pig is preferred over MapReduce:

  • While Pig is a language for high-level data flow, MapReduce is a paradigm for low-level data processing.
  • Without the need to write complex Java code in MapReduce, a similar result can easily be achieved in Pig.
  • Pig approximately reduces the code length by 20 times, reducing the time taken for development by about 16 times than MapReduce.
  • Pig offers built-in functionalities to perform numerous operations, including sorting, filters, joins, ordering, etc., which are extremely difficult to perform in MapReduce.
  • Unlike MapReduce, Pig provides various nested data types such as bags, maps, and tuples.

40. What are the components of the Apache Pig architecture?

The components of the Apache Pig architecture are as follows:

  • Parser: It is responsible for handling Pig scripts and checking the syntax of the script.
  • Optimizer: Its function is to carry out the logical optimization such as projection pushdown, etc. It is the optimizer that receives the logical plan (DAG).
  • Compiler: It is responsible for the conversion of the logical plan into a series of MapReduce jobs.
  • Execution Engine: In the execution engine, MapReduce jobs get submitted in Hadoop in a sorted manner.
  • Execution Mode: The execution modes in Apache Pig are local, and MapReduce modes and their selection entirely depends on the location where the data is stored and the place where you want to run the Pig script.

41. Mention some commands in YARN to check application status and to kill an application.

 

The YARN commands are mentioned below as per their functionalities:

1. yarn application - status ApplicationID

This command allows professionals to check the application status.

2. yarn application - kill ApplicationID

The command mentioned above enables users to kill or terminate a particular application.

42. What are the different components of Hive query processors?

There are numerous components that are used in Hive query processors. They are mentioned below:

  • User-defined functions
  • Semantic analyzer
  • Optimizer
  • Physical plan generation
  • Logical plan generation
  • Type checking
  • Execution engine
  • Parser
  • Operators

43. What are the commands to restart NameNode and all the daemons in Hadoop?

The following commands can be used to restart NameNode and all the daemons:

  • NameNode can be stopped with the ./sbin /Hadoop-daemon.sh stop NameNode command. The NameNode can be started by using the ./sbin/Hadoop-daemon.sh start NameNode command.
  • The daemons can be stopped with the ./sbin /stop-all.sh The daemons can be started by using the ./sbin/start-all.sh command.

44. Define DataNode. How does NameNode tackle DataNode failures?

DataNode stores data in HDFS; it is a node where actual data resides in the file system. Each DataNode sends a heartbeat message to notify that it is alive. If the NameNode does not receive a message from the DataNode for 10 minutes, the NameNode considers the DataNode to be dead or out of place and starts the replication of blocks that were hosted on that DataNode such that they are hosted on some other DataNode. A BlockReport contains a list of all blocks on a DataNode. Now, the system starts to replicate what was stored in the dead DataNode.

The NameNode manages the replication of the data blocks from one DataNode to another. In this process, the replication data gets transferred directly between DataNodes such that the data never passes the NameNode.

You will find more in our Hadoop Community!

45. What is the significance of Sqoop’s eval tool?

The eval tool in Sqoop enables users to carry out user-defined queries on the corresponding database servers and check the outcome in the console.

46. Can you name the default file formats for importing data using Apache Sqoop?

Commonly, there are two file formats in Sqoop to import data. They are:

  • Delimited Text File Format
  • Sequence File Format

47. What is the difference between relational database and HBase?

The difference between relational database and HBase are mentioned below:

Relational DatabaseHBase
It is schema-based.It has no schema.
It is row-oriented.It is column-oriented.
It stores normalized data.It stores denormalized data.
It consists of thin tables.It consists of sparsely populated tables.
There is no built-in support or provision for automatic partitioning.It supports automated partitioning.

48. What is the jps command used for?

The jps command is used to know or check whether the Hadoop daemons are running or not. The active or running status of all Hadoop daemons, which are namenode, datanode, resourcemanager, nodemanager, are displayed by this command.

Hadoop Interview Questions

49. What are the core methods of a reducer?

The three core methods of a reducer are as follows:

  • setup(): This method is used for configuring various parameters such as input data size and distributed cache.
    public void setup (context)
  • reduce(): This method is the heart of the reducer and is always called once per key with the associated reduced task.
    public void reduce(Key, Value, context)
  • cleanup(): This method is called to clean the temporary files, only once at the end of the task.
    public void cleanup (context)

50. What is Apache Flume? List the components of Apache Flume

Apache Flume is a tool or system, in Hadoop, that is used for assembling, aggregating, and carrying large amounts of streaming data. This can include data such as record files, events, etc. The main function of Apache Flume is to carry this streaming data from various web servers to HDFS.

The components of Apache Flume are as below:

  • Flume Channel
  • Flume Source
  • Flume Agent
  • Flume Sink
  • Flume Event

Advanced Interview Questions

51. What are the differences between MapReduce and Pig?

The differences between MapReduce and Pig are mentioned below:

MapReducePig
It has more lines of code as compared to Pig.It has fewer lines of code.
It is a low-level language that makes it difficult to perform operations such as join.It is a high-level language that makes it easy to perform join and other similar operations.
Its compiling process is time-consuming.During execution, all the Pig operators are internally converted into a MapReduce job.
A MapReduce program that is written in a particular version of Hadoop may not work in others.It works in all Hadoop versions.

52. List the configuration parameters in a MapReduce program

The configuration parameters in MapReduce are given below:

  • Input locations of Jobs in the distributed file system
  • Output location of Jobs in the distributed file system
  • The input format of data
  • The output format of data
  • The class containing the map function
  • The class containing the reduce function
  • JAR file containing the classes—mapper, reducer, and driver

53. What is the default file size of an HDFS data block?

Hadoop keeps the default file size of an HDFS data block as 128 mb.

54. Why are the data blocks in HDFS so huge?

The reason behind the large size of the data blocks in HDFS is that the transfer happens at the disk transfer rate in the presence of large-sized blocks. On the other hand, if the size is kept small, there will be a large number of blocks to be transferred, which will force the HDFS to store too much metadata, thus increasing traffic.

55. What is a SequenceFile in Hadoop?

Extensively used in MapReduce I/O formats, SequenceFile is a flat-file containing binary key-value pairs. The map outputs are stored as SequenceFile internally. It provides reader, writer, and sorter classes. The three SequenceFile formats are as follows:

  • Uncompressed key-value records
  • Record compressed key-value records—only values are compressed here
  • Block compressed key-value records—both keys and values are collected in blocks separately and compressed. The size of the block is configurable

Want to know more about Hadoop? Go through this extensive Hadoop Tutorial!

56. What do you mean by WAL in HBase?

WAL is otherwise referred to as a write ahead log. This file is attached to each Region Server present inside the distributed environment. WAL stores the new data, which is yet to be kept in permanent storage. WAL is often used to recover datasets in case of any failure.

57. List the two types of metadata that are stored by the NameNode server

The NameNode server stores metadata in disk and RAM. The two types of metadata that the NameNode server stores are:

  • EditLogs
  • FsImage

58. Explain the architecture of YARN and how it allocates various resources to applications?

There is an application, API, or client that communicates with the ResourceManager, which then deals with allocating resources in the cluster. It has an awareness of the resources present with each node manager. There are two internal components of the ResourceManager, application manager and scheduler. The scheduler is responsible for allocating resources to the numerous applications running in parallel based on their requirements. However, the scheduler does not track the application status.

The application manager accepts the submission of jobs and manages and reboots the application master if there is a failure. It manages the applications’ demands for resources and communicates with the scheduler to get the needed resources. It interacts with the NodeManager to manage and execute the tasks that monitor the jobs running. Moreover, it also monitors the resources utilized by each container.

A container consists of a set of resources, including CPU, RAM, and network bandwidth. It allows the applications to use a predefined number of resources.

The ResourceManager sends a request to the NodeManager to keep a few resources to process as soon as there is a job submission. Later, the NodeManager assigns an available container to carry out the processing. The ResourceManager then starts the application master to deal with the execution and it runs in one of the given containers. The rest of the containers available are used for the execution process. This is the overall process of how YARN allocates resources to applications via its architecture.

59. What are the differences between Sqoop and Flume?

The following are the various differences between Sqoop and Flume:

 

SqoopFlume
It works with NoSQL databases and RDBMS for importing and exporting data.It works with streaming data, which is regularly generated in the Hadoop environment.
In Sqoop, loading data is not event-driven.In Flume, loading data is event-driven.
It deals with data sources that are structured, and Sqoop connectors help in extracting data from them.It extracts streaming data from application or web servers.
It takes data from RDBMS, imports it to HDFS, and exports it back to RDBMS.Data from multiple sources flows into HDFS.

60. What is the role of a JobTracker in Hadoop?

A JobTracker’s primary role is resource management, managing the TaskTrackers, tracking resource availability, and task life cycle management, tracking the tasks’ progress and fault tolerance.

 

  • JobTracker is a process that runs on a separate node, often not on a DataNode.
  • JobTracker communicates with the NameNode to identify the data location.
  • JobTracker finds the best TaskTracker nodes to execute the tasks on the given nodes.
  • JobTracker monitors individual TaskTrackers and submits the overall job back to the client.
  • JobTracker tracks the execution of MapReduce workloads local to the slave node.

Enroll in the Hadoop Course in London to get a clear understanding of Hadoop!

61. Can you name the port numbers for JobTracker, NameNode, and TaskTracker

JobTracker: The port number for JobTracker is Port 50030

NameNode: The port number for NameNode is Port 50070

TaskTracker: The port number for TaskTracker is Port 50060

62. What are the components of the architecture of Hive?

  • User Interface: It requests the execute interface for the driver and also builds a session for this query. Further, the query is sent to the compiler in order to create an execution plan for the same.
  • Metastore: It stores the metadata and transfers it to the compiler to execute a query.
  • Compiler: It creates the execution plan. It consists of a DAG of stages wherein each stage can either be a map, metadata operation, or reduce an operation or job on HDFS.
  • Execution Engine: It bridges the gap between Hadoop and Hive and helps in processing the query. It communicates with the metastore bidirectionally in order to perform various tasks.

63. Is it possible to import or export tables in HBase?

Yes, tables can be imported and exported in HBase clusters by using the commands listed below:

For export:

hbase org.apache.hadoop.hbase.mapreduce.Export “table name” “target export location”

For import:

create ‘emp_table_import’, {NAME => ‘myfam’, VERSIONS => 10}

hbase org.apache.hadoop.hbase.mapreduce.Import “table name” “target import location”

64. Why does Hive not store metadata in HDFS?

Hive stores the data of HDFS and the metadata is stored in the RDBMS or it is locally stored. HDFS does not store this metadata because the read or write operations in HDFS take a lot of time. This is why Hive uses RDBMS to store this metadata in the megastore rather than HDFS. This makes the process faster and enables you to achieve low latency.

65. What are the significant components in the execution environment of Pig?

The main components of a Pig execution environment are as follows:

  • Pig Scripts: They are written in Pig with the help of UDFs and built-in operators and are then sent to the execution environment.
  • Parser: It checks the script syntax and completes type checking. Parser’s output is a directed acyclic graph (DAG).
  • Optimizer: It conducts optimization with operations such as transform, merges, etc., to minimize the data in the pipeline.
  • Compiler: It automatically converts the code that is optimized into a MapReduce job.
  • Execution Engine: The MapReduce jobs are sent to these engines in order to get the required output.

66. What is the command used to open a connection in HBase?

The command mentioned below can be used to open a connection in HBase:

Configuration myConf = HBaseConfiguration.create();
HTableInterface usersTable = new HTable(myConf, “users”);

67. What is the use of RecordReader in Hadoop?

Though InputSplit defines a slice of work, it does not describe how to access it. This is where the RecordReader class comes into the picture; it takes the byte-oriented data from its source and converts it into record-oriented key-value pairs such that it is fit for the Mapper task to read it. Meanwhile, InputFormat defines this Hadoop RecordReader instance.

68. How does Sqoop import or export data between HDFS and RDBMS?

The steps followed by Sqoop to import and export data, using its architecture, between HDFS and RDBMS are listed below:

  • Search the database to collect metadata.
  • Sqoop splits the input dataset and makes use of the respective map jobs to push these splits to HDFS.
  • Search the database to collect metadata.
  • Sqoop splits the input dataset and makes use of respective map jobs to push these splits to RDBMS. Sqoop exports back the Hadoop files to the RDBMS tables.

69. What is speculative execution in Hadoop?

One limitation of Hadoop is that by distributing the tasks on several nodes, there are chances that a few slow nodes limit the rest of the program. There are various reasons for the tasks to be slow, which are sometimes not easy to detect. Instead of identifying and fixing the slow-running tasks, Hadoop tries to detect when the task runs slower than expected and then launches other equivalent tasks as a backup. This backup mechanism in Hadoop is speculative execution.

Speculative execution creates a duplicate task on another disk. The same input can be processed multiple times in parallel. When most tasks in a job come to completion, the speculative execution mechanism schedules duplicate copies of the remaining tasks, which are slower, across the nodes that are free currently. When these tasks are finished, it is intimated to the JobTracker. If other copies are executing speculatively, then Hadoop notifies the TaskTrackers to quit those tasks and reject their output.

Speculative execution is, by default, true in Hadoop. To disable it, mapred.map.tasks.speculative.execution and mapred.reduce.tasks.speculative.execution JobConf options can be set to false.

Are you interested in learning Hadoop from experts? Enroll in our Hadoop Course in Bangalore now!

70. What is Apache Oozie?

Apache Oozie is nothing but a scheduler that helps to schedule jobs in Hadoop and bundles them as a single logical work. Oozie jobs can largely be divided into the following two categories:

  • Oozie Workflow: These jobs are a set of sequential actions that need to be executed.
  • Oozie Coordinator: These jobs are triggered as and when there is data available for them, until which, it rests.

71. What happens if you try to run a Hadoop job with an output directory that is already present?

It will throw an exception saying that the output file directory already exists.

To run the MapReduce job, it needs to be ensured that the output directory does not exist in the HDFS.

To delete the directory before running the job, shell can be used:

Hadoop fs –rmr /path/to/your/output/

Or the Java API:

FileSystem.getlocal(conf).delete(outputDir, true);

72. How can you debug Hadoop code?

First, the list of MapReduce jobs currently running should be checked. Next, it needs to be ensured that there are no orphaned jobs running; if yes, the location of RM logs needs to be determined.

  • Run:
ps –ef | grep –I ResourceManager

Look for the log directory in the displayed result. Find out the job ID from the displayed list and check if there is an error message associated with that job.

  • On the basis of RM logs, identify the worker node that was involved in the execution of the task.
  • Now, log in to that node and run the below-mentioned code:
ps –ef | grep –iNodeManager
  • Then, examine the NodeManager The majority of errors come from the user-level logs for each MapReduce job.

73. How to configure the replication factor in HDFS?

The hdfs-site.xml file is used to configure HDFS. Changing the dfs.replication property in hdfs-site.xml will change the default replication for all the files placed in HDFS.

The replication factor on a per-file basis can also be modified by using the following:

Hadoop FS Shell:[training@localhost ~]$ hadoopfs –setrep –w 3 /my/fileConversely,

The replication factor of all the files under a directory can also be changed.

[training@localhost ~]$ hadoopfs –setrep –w 3 -R /my/dir

Learn more about Hadoop from this Big Data Hadoop Training in New York to get ahead in your career!

74. How to compress a mapper output not touching reducer output?

To achieve this compression, the following should be set:

conf.set("mapreduce.map.output.compress", true)
conf.set("mapreduce.output.fileoutputformat.compress", false)

75. What are the basic parameters of a mapper?

Given below are the basic parameters of a mapper:

  • LongWritable and Text
  • Text and IntWritable

76. What is the difference between map-side join and reduce-side join?

Map-side join is performed when data reaches the map. A strict structure is needed for defining map-side join.

Map Side Join

On the other hand, reduce-side join, or repartitioned join, is simpler than map-side join since the input datasets in reduce-side join need not be structured. However, it is less efficient as it will have to go through sort and shuffle phases, coming with network overheads.

77. How can you transfer data from Hive to HDFS?

By writing the query:

hive> insert overwrite directory '/' select * from emp;

Write the query for the data to be imported from Hive to HDFS. The output received will be stored in part files in the specified HDFS path.

78. Which companies use Hadoop?

  • Yahoo! – It is the biggest contributor to the creation of Hadoop; its search engine uses Hadoop
  • Facebook – developed Hive for analysis
  • Amazon
  • Netflix
  • Adobe
  • eBay
  • Spotify
  • Twitter

companies use Hadoop

Check out this video on Hadoop Interview Questions and Answers:

Original article source at: https://intellipaat.com/

#bigdata #hadoop #interview #question #answers 

Top Hadoop Interview Questions and Answers
Saul  Alaniz

Saul Alaniz

1651739418

Cómo Usar Google Dataproc: Ejemplo Con PySpark Y Jupyter Notebook

En este artículo, explicaré qué es Dataproc y cómo funciona.

Dataproc es un servicio administrado de Google Cloud Platform para Spark y Hadoop que lo ayuda con el procesamiento de Big Data, ETL y aprendizaje automático. Proporciona un clúster de Hadoop y admite herramientas de ecosistemas de Hadoop como Flink, Hive, Presto, Pig y Spark.

Dataproc es un clúster de escalado automático que administra el registro, la supervisión, la creación de clústeres de su elección y la orquestación de trabajos. Deberá aprovisionar manualmente el clúster, pero una vez que se haya aprovisionado, podrá enviar trabajos a Spark, Flink, Presto y Hadoop.

Dataproc tiene una integración implícita con otros productos de GCP, como Compute Engine, Cloud Storage, Bigtable, BigQuery, Cloud Monitoring, etc. Los trabajos admitidos por Dataproc son MapReduce, Spark, PySpark, SparkSQL, SparkR, Hive y Pig.

Aparte de eso, Dataproc también permite la integración nativa con Jupyter Notebooks, que trataremos más adelante en este artículo.

En el artículo, vamos a cubrir:

  1. Tipos de clústeres de Dataproc y cómo configurar Dataproc
  2. Cómo enviar un trabajo de PySpark a Dataproc
  3. Cómo crear una instancia de Notebook y ejecutar trabajos PySpark a través de Jupyter Notebook.

Cómo crear un clúster de Dataproc

Dataproc tiene tres tipos de clústeres:

  1. Estándar
  2. Nodo único
  3. Alta disponibilidad

El clúster estándar puede constar de 1 maestro y N nodos trabajadores. El nodo único tiene solo 1 nodo maestro y 0 nodos trabajadores. Para fines de producción, debe usar el clúster de alta disponibilidad que tiene 3 nodos maestros y N trabajadores.

Para nuestros propósitos de aprendizaje, es suficiente un clúster de un solo nodo que tenga solo 1 nodo maestro.

Crear clústeres de Dataproc en GCP es sencillo. Primero, necesitaremos habilitar Dataproc y luego podremos crear el clúster.

imagen-185

Iniciar la creación del clúster de Dataproc

Cuando hace clic en "Crear clúster", GCP le ofrece la opción de seleccionar el tipo de clúster, el nombre del clúster, la ubicación, las opciones de escalado automático y más.

imagen-199

Parámetros necesarios para el clúster 

Dado que seleccionamos la opción Clúster de un solo nodo, esto significa que el escalado automático está deshabilitado ya que el clúster consta de solo 1 nodo principal.

La opción Configurar nodos nos permite seleccionar el tipo de familia de máquinas como Compute Optimized, GPU y General-Purpose.

En este tutorial, usaremos la opción de máquina de propósito general. A través de esto, puede seleccionar las opciones Tipo de máquina, Tamaño de disco principal y Tipo de disco.

El tipo de máquina que vamos a seleccionar es n1-standard-2 que tiene 2 CPU y 7,5 GB de memoria. El tamaño del disco principal es de 100 GB, que es suficiente para nuestros propósitos de demostración aquí.

imagen-200

Configuración del nodo maestro

Hemos seleccionado el tipo de clúster de Nodo único, por lo que la configuración consta solo de un nodo maestro. Si selecciona cualquier otro tipo de clúster, también deberá configurar el nodo maestro y los nodos trabajadores.

En la opción Personalizar clúster, seleccione la configuración de red predeterminada:

imagen-201

Use la opción "Eliminación programada" en caso de que no se requiera un clúster en un momento futuro específico (o digamos después de algunas horas, días o minutos).

5_ml_resize_x2_colored_toned_light_ai-1

Configuración de eliminación de programa

Aquí, configuramos el "Tiempo de espera" en 2 horas, por lo que el clúster se eliminará automáticamente después de 2 horas.

Usaremos la opción de seguridad predeterminada, que es una clave de cifrado administrada por Google. Cuando haga clic en "Crear", comenzará a crear el clúster.  

También puede crear el clúster usando el comando 'gcloud' que encontrará en la opción 'LÍNEA DE COMANDO EQUIVALENTE' como se muestra en la imagen a continuación.

Y puede crear un clúster mediante una solicitud POST que encontrará en la opción 'REST equivalente'.

imagen-203

Opción gcloud y REST para la creación de clústeres

Después de unos minutos, el clúster con 1 nodo maestro estará listo para usarse.

imagen-204

Clúster en funcionamiento

Puede encontrar detalles sobre las instancias de VM si hace clic en "Nombre del clúster":

imagen-205

imagen-206

Cómo enviar un trabajo de PySpark

Comprendamos brevemente cómo funciona un trabajo de PySpark antes de enviar uno a Dataproc. Es un trabajo simple de identificar los distintos elementos de la lista que contiene elementos duplicados.

#! /usr/bin/python

import pyspark

#Create List
numbers = [1,2,1,2,3,4,4,6]

#SparkContext
sc = pyspark.SparkContext()

# Creating RDD using parallelize method of SparkContext
rdd = sc.parallelize(numbers)

#Returning distinct elements from RDD
distinct_numbers = rdd.distinct().collect()

#Print
print('Distinct Numbers:', distinct_numbers)

Código para encontrar elementos distintos de la Lista 

Cargue el archivo .py en el depósito de GCS y necesitaremos su referencia al configurar el trabajo de PySpark.

imagen-21

Ubicación de GCS del trabajo

Enviar trabajos en Dataproc es sencillo. Solo necesita seleccionar la opción "Enviar trabajo":

imagen-209

Envío de trabajo 

Para enviar un trabajo, deberá proporcionar el ID del trabajo, que es el nombre del trabajo, la región, el nombre del clúster (que será el nombre del clúster, "first-data-proc-cluster"), y el tipo de trabajo que será PySpark.

imagen-223

Parámetros necesarios para el envío de trabajos

Puede obtener la ubicación del archivo de Python desde el depósito de GCS donde se carga el archivo de Python; lo encontrará en gsutil URI.

imagen-24

No se requieren otros parámetros adicionales, y ahora podemos enviar el trabajo:

imagen-224

Después de la ejecución, debería poder encontrar los distintos números en los registros:

imagen-213

Registros 

Cómo crear una instancia de Jupyter Notebook

Puedes asociar una instancia de notebook con Dataproc Hub. Para hacer eso, GCP aprovisiona un clúster para cada instancia de Notebook. Podemos ejecutar los tipos de trabajos PySpark y SparkR desde el cuaderno.

Para crear un cuaderno, use la opción "Banco de trabajo" como se muestra a continuación:

imagen-26

Asegúrese de pasar por las configuraciones habituales, como el nombre del cuaderno, la región, el entorno (Dataproc Hub) y la configuración de la máquina (estamos usando 2 vCPU con 7,5 GB de RAM). Estamos usando la configuración de red predeterminada y, en la sección Permiso, seleccione la opción "Cuenta de servicio".

imagen-225

Parámetros necesarios para la creación de clústeres de cuadernos

Haz clic en Crear:

imagen-216

Grupo de portátiles en funcionamiento

La opción "ABRIR JUPYTYERLAB" permite a los usuarios especificar las opciones de clúster y la zona para su computadora portátil.

imagen-226

imagen-227

Una vez que se completa el aprovisionamiento, Notebook le brinda algunas opciones de kernel:

imagen-27

Haga clic en PySpark, que le permitirá ejecutar trabajos a través de Notebook.

Una instancia de SparkContext ya estará disponible, por lo que no necesita crear SparkContext explícitamente. Aparte de eso, el programa sigue siendo el mismo.

imagen-220

Instantánea de código en Notebook

Conclusión

Trabajar en Spark y Hadoop se vuelve mucho más fácil cuando usa GCP Dataproc. La mejor parte es que puede crear un clúster de portátiles que simplifica el desarrollo.

Fuente: https://www.freecodecamp.org/news/what-is-google-dataproc/

#dataproc #apache-spark #hadoop #pyspark #jupyter 

Cómo Usar Google Dataproc: Ejemplo Con PySpark Y Jupyter Notebook

Google Dataprocの使用方法–PySparkとJupyterNotebookの例

この記事では、Dataprocとは何かとその仕組みについて説明します。

Dataprocは、ビッグデータ処理、ETL、機械学習を支援するSparkとHadoop向けのGoogleCloudPlatformマネージドサービスです。Hadoopクラスターを提供し、Flink、Hive、Presto、Pig、SparkなどのHadoopエコシステムツールをサポートします。

Dataprocは、ロギング、モニタリング、選択したクラスターの作成、およびジョブのオーケストレーションを管理する自動スケーリングクラスターです。クラスターを手動でプロビジョニングする必要がありますが、クラスターがプロビジョニングされると、Spark、Flink、Presto、およびHadoopにジョブを送信できます。

Dataprocは、Compute Engine、Cloud Storage、Bigtable、BigQuery、CloudMonitoringなどの他のGCP製品と暗黙的に統合されています。Dataprocでサポートされているジョブは、MapReduce、Spark、PySpark、SparkSQL、SparkR、Hive、Pigです。

それとは別に、DataprocではJupyter Notebookとのネイティブ統合も可能です。これについては、この記事の後半で説明します。

この記事では、以下について説明します。

  1. DataprocクラスタータイプとDataprocの設定方法
  2. PySparkジョブをDataprocに送信する方法
  3. ノートブックインスタンスを作成し、JupyterNotebookを介してPySparkジョブを実行する方法。

Dataprocクラスターを作成する方法

Dataprocには、次の3つのクラスタータイプがあります。

  1. 標準
  2. シングルノード
  3. 高可用性

標準クラスターは、1つのマスターノードとNのワーカーノードで構成できます。シングルノードには、マスターノードが1つ、ワーカーノードが0つしかありません。本番環境では、3つのマスターノードとNのワーカーノードを持つ高可用性クラスターを使用する必要があります。

学習目的では、マスターノードが1つしかない単一ノードクラスターで十分です。

GCPでDataprocクラスターを作成するのは簡単です。まず、Dataprocを有効にする必要があります。次に、クラスターを作成できるようになります。

画像-185

Dataprocクラスターの作成を開始します

[クラスターの作成]をクリックすると、GCPには、クラスタータイプ、クラスター名、場所、自動スケーリングオプションなどを選択するオプションが表示されます。

画像-199

クラスターに必要なパラメーター 

[シングルノードクラスター]オプションを選択したため、クラスターは1つのマスターノードのみで構成されているため、自動スケーリングが無効になっていることを意味します。

[ノードの構成]オプションを使用すると、Compute Optimized、GPU、General-Purposeなどのマシンファミリーのタイプを選択できます。

このチュートリアルでは、汎用マシンオプションを使用します。これにより、マシンタイプ、プライマリディスクサイズ、およびディスクタイプオプションを選択できます。

選択するマシンタイプはn1-standard-2で、2つのCPUと7.5GBのメモリがあります。プライマリディスクのサイズは100GBで、ここでのデモの目的には十分です。

画像-200

マスターノードの構成

シングルノードのクラスタータイプを選択したため、構成はマスターノードのみで構成されています。他のクラスタータイプを選択する場合は、マスターノードとワーカーノードも構成する必要があります。

[クラスターのカスタマイズ]オプションから、デフォルトのネットワーク構成を選択します。

画像-201

指定された将来の時間(または数時間、数日、または数分後など)にクラスターが不要な場合は、[スケジュールされた削除]オプションを使用します。

5_ml_resize_x2_colored_toned_light_ai-1

スケジュール削除設定

ここでは、「タイムアウト」を2時間に設定しているため、2時間後にクラスターは自動的に削除されます。

Googleが管理する暗号化キーであるデフォルトのセキュリティオプションを使用します。「作成」をクリックすると、クラスターの作成が開始されます。  

次の画像に示すように、「EQUIVALENTCOMMANDLINE」オプションにある「gcloud」コマンドを使用してクラスターを作成することもできます。

また、「同等のREST」オプションにあるPOSTリクエストを使用してクラスターを作成できます。

画像-203

クラスター作成用のgcloudおよびRESTオプション

数分後、1つのマスターノードを持つクラスターを使用できるようになります。

画像-204

クラスターの稼働

[クラスタ名]をクリックすると、VMインスタンスの詳細を確認できます。

画像-205

画像-206

PySparkジョブを送信する方法

Dataprocに送信する前に、PySparkジョブがどのように機能するかを簡単に理解しましょう。これは、重複する要素を含むリストから個別の要素を識別する簡単な作業です。

#! /usr/bin/python

import pyspark

#Create List
numbers = [1,2,1,2,3,4,4,6]

#SparkContext
sc = pyspark.SparkContext()

# Creating RDD using parallelize method of SparkContext
rdd = sc.parallelize(numbers)

#Returning distinct elements from RDD
distinct_numbers = rdd.distinct().collect()

#Print
print('Distinct Numbers:', distinct_numbers)

リストから個別の要素を見つけるためのコード 

.pyファイルをGCSバケットにアップロードすると、PySparkジョブを構成するときにその参照が必要になります。

画像-21

ジョブGCSの場所

Dataprocでのジョブの送信は簡単です。「ジョブの送信」オプションを選択する必要があります。

画像-209

仕事の提出 

ジョブを送信するには、ジョブの名前、リージョン、クラスター名(クラスターの名前「first-data-proc-cluster」)であるジョブIDを指定する必要があります。そして、PySparkになる予定のジョブタイプ。

画像-223

ジョブの送信に必要なパラメーター

Pythonファイルの場所は、PythonファイルがアップロードされているGCSバケットから取得できます。これはgsutilURIにあります。

画像-24

他の追加パラメーターは必要ありません。これで、ジョブを送信できます。

画像-224

実行後、ログで個別の番号を見つけることができるはずです。

画像-213

ログ 

JupyterNotebookインスタンスを作成する方法

ノートブックインスタンスをDataprocHubに関連付けることができます。これを行うために、GCPはノートブックインスタンスごとにクラスターをプロビジョニングします。ノートブックからPySparkおよびSparkRタイプのジョブを実行できます。

ノートブックを作成するには、次のような「ワークベンチ」オプションを使用します。

画像-26

ノートブック名、リージョン、環境(Dataproc Hub)、マシン構成(7.5 GB RAMを備えた2つのvCPUを使用)などの通常の構成を必ず実行してください。デフォルトのネットワーク設定を使用しており、[権限]セクションで[サービスアカウント]オプションを選択します。

画像-225

ノートブッククラスターの作成に必要なパラメーター

[作成]をクリックします。

画像-216

ノートブッククラスターの稼働

「OPENJUPYTYERLAB」オプションを使用すると、ユーザーはノートブックのクラスターオプションとゾーンを指定できます。

画像-226

画像-227

プロビジョニングが完了すると、ノートブックにはいくつかのカーネルオプションが表示されます。

画像-27

ノートブックを介してジョブを実行できるようにするPySparkをクリックします。

SparkContextインスタンスはすでに利用可能であるため、SparkContextを明示的に作成する必要はありません。それを除けば、プログラムは同じままです。

画像-220

ノートブックのコードスナップショット

結論

GCP Dataprocを使用すると、SparkとHadoopでの作業がはるかに簡単になります。最良の部分は、開発を簡単にするノートブッククラスターを作成できることです。

出典:https ://www.freecodecamp.org/news/what-is-google-dataproc/

#dataproc #apache-spark #hadoop #pyspark #jupyter 

Google Dataprocの使用方法–PySparkとJupyterNotebookの例
Gunjan  Khaitan

Gunjan Khaitan

1649489633

Hadoop Tutorial for Beginners - Full Course

Hadoop Tutorial For Beginners 2022 | Hadoop Full Course In 10 Hours | Big Data Tutorial

This full course video on Hadoop will introduce you to the world of big data, the applications of big data, the significant challenges in big data, and how Hadoop solves these major challenges. You will get an idea about the essential tools that are part of the Hadoop ecosystem. You will learn how Hadoop stores vast volumes of data using HDFS, and processes this data using MapReduce. You will understand how cluster resource management works using YARN. It will make you know how you can query and analyze big data using tools and frameworks like Hive, Pig, Sqoop, and HBase. These of these tools will give a hands-on experience that will help you understand it better. Finally, you will see how to become a big data engineer and come across a few important interview questions to build your career in Hadoop. Now, let's get started and learn Hadoop.

The below topics are covered in this Hadoop full course tutorial:

  • Evolution of Big Data
  • Why Big Data
  • What is Big Data
  • 5V's of Big Data
  • Big Data Case Study
  • Challenges of Big Data
  • Hadoop as a Solutions
  • History of Hadoop
  • Cloudera Hadoop Installation
  • Hadoop Installation on Ubuntu
  • Hadoop Ecosystem
  • HDFS Tutorial
  • Why HDFS?
  • What is HDFS? 
  • HDFS Cluster Architecture
  • HDFS Data Blocks
  • DataNode Failure and Replication
  • Rack Awareness in HDFS
  • HDFS Architecture
  • HDFS Read Mechanism
  • HDFS Write Mechanism
  • HDFS Write Mechanism with example
  • Advantages of HDFS
  • HDFS Tutorial
  • MapReduce Analogy
  • What is MapReduce?
  • Parallel Processing MapReduce
  • MapReduce Workflow
  • MapReduce Architecture
  • MapReduce Example
  • Hadoop 1.0 (MR 1)
  • Limitations of Hadoop 1.0 (MR 1)
  • Need for YARN - 4:02:25
  • Solution - Hadoop 2.0 (YARN) - 4:04:15
  • What is YARN? - 4:05:13
  • Workloads running on YARN - 4:06:33
  • YARN Components
  • YARN Components - Resource Manager
  • YARN Components - Node Manager
  • YARN Architecture
  • Running an application in YARN
  • Need for Sqoop
  • What is Sqoop
  • Sqoop Features
  • Sqoop Architecture
  • Sqoop Import
  • Sqoop Export
  • Sqoop Processing
  • Demo on Sqoop
  • Flume
  • Hadoop Ecosystem
  • History of Hive
  • Big Data Analytics
  • Big Data Applications
  • How to become a Big Data Engineer
  • Hadoop Interview Questions

#hadoop #bigdata 

Hadoop Tutorial for Beginners - Full Course
Madyson  Moore

Madyson Moore

1647802800

How to install Hadoop on Windows 10 In 9 Minutes

In this video tutorial we will see how to install Hadoop on windows 10, this video is for the complete installation of Cloudera Quickstart VM installation in the virtualbox.

Hadoop 3 installation is done in 10 minutes, how to install hadoop in windows 10 is done with set up hadoop home inside windows 10 os.

In the video how to configure hadoop in windows with java home for hadoop is set automatically with the Cloudera Quickstart VM in Virtualbox.

We have discuss following points:
- How to install hadoop in windows 10
- how to install hadoop on virtualbox
- how to download cloudera quickstart vm for virtualbox
- big data tutorial

- - - - - -   Links to download Hadoop - - - - - - 
Cloudera Download 5.4.2 : -- https://downloads.cloudera.com/demo_vm/virtualbox/cloudera-quickstart-vm-5.4.2-0-virtualbox.zip

Cloudera Version 5.13.0 : https://downloads.cloudera.com/demo_vm/virtualbox/cloudera-quickstart-vm-5.13.0-0-virtualbox.zip

#hadoop 

How to install Hadoop on Windows 10 In 9 Minutes
Felix Kling

Felix Kling

1646807855

Getting Started with Hadoop & Apache Spark

Getting Started with Hadoop & Apache Spark

1 - Installing Debian

In this video we are installing Debian which we will use as an operating system to run a Hadoop and Apache Spark pseudo cluster.
This video covers creating a Virtual Machine in Windows, Downloading & Installing Debian, and the absolute basics of working with Linux.

2 - Downloading Hadoop
Here we will download Hadoop to our newly configured Virtual Machine. We will extract it and check whether it just works out of the box.

3 - Configuring Hadoop
After downloading and installing Hadoop we are going to configure it. After all configurations are done, we will have a working pseudo cluster for HDFS.

4 - Configuring YARN
After configuring our HDFS, we now want to configure a resource manager (YARN) to manage our pseudo cluster. For this we will adjust quite a few configurations. 
You can download my config file via the following link: https://drive.google.com/file/d/11FL12RHSAug_aQtvaG4r2KJP1RhMw3Pk/view

5 - Interacting with HDFS
After making all the configurations we can finally fire up our Hadoop cluster and start interacting with it. We will learn how to interact with HDFS such as listing the content and uploading data to it.

6 - Installing & Configuring Spark
After we are done configuring our HDFS, it is now time to get a good computation engine. For this we will download and configure Apache Spark.

7 - Loading Data into Spark
Having a running Spark pseudo cluster, we now want to load data from HDFS into a Spark data frame

8 - Running SQL Queries in Spark
Let us learn how to run typical SQL queries in Apache Spark. This includes selecting columns, filtering rows, joining tables, and creating new columns from existing ones.

9 - Save Data from Spark to HDFS
In the last video of this series we will save our Spark data frame into a Parquet file on HDFS.

#hadoop #apachespark #bigdata

Getting Started with Hadoop & Apache Spark