How We Improved Spark Jobs on HDFS Up To 30 Times

As the third largest e-commerce site in China, Vipshop processes large amounts of data collected daily to generate targeted advertisements for its consumers. In this article, guest author Gang Deng from Vipshop describes how to meet SLAs by improving struggling Spark jobs on HDFS by up to 30x, and optimize hot data access with Alluxio to create a reliable and stable computation pipeline for e-commerce targeted advertising.

I. Overview

Targeted advertising is the key for many e-commerce companies to acquire new customers and increase their Gross Merchandise Volume (GMV). In modern advertising systems for e-commerce sites like VIP.com, analyzing data at a large scale is the key to efficient targeted advertising.

Every day at Vipshop, we run tens of thousands of queries to derive insights for targeted ads from a dozen of Hive tables stored in HDFS. Our platform is based on Hadoop clusters to provide persistent and scalable storage using HDFS and efficient and reliable computation using Hive MapReduce or Spark orchestrated by YARN. In particular, Yarn and HDFS are deployed together where each instance hosts both a HDFS DataNode process and a YARN NodeManager process.

In our pipeline, the input is historical data from the previous day. Typically, this daily data is available at 8 AM and the pipeline must complete within 12 hours due to the time-sensitivity of targeted ads. Among the various computation tasks, Spark accounts for 90% of the tasks and Hive takes up the remainder.

With Alluxio, we separate storage and compute by moving HDFS to an isolated cluster. Resources on the compute cluster are scaled independently of storage capacity, while using Alluxio to provide improved data locality by making additional copies for frequently accessed data.

#e-commerce #hadoop #hdfs #open-source #data-engineering #apache-spark #targeted-advertisement #sla

I. Overview

hackernoon.com

How We Improved Spark Jobs on HDFS Up To 30 Times | Hacker Noon