Wilford  Pagac

Wilford Pagac

1602302400

Migrating Apache Spark workloads from AWS EMR to Kubernetes

Introduction

I will focus on AWS Elastic Map Reduce since we are running our Spark workloads on AWS. We are using Apache Airflow for the workflow orchestration.

Image for post

Data Flow

The data comes from different sources that are spread across different geo regions and not necessarily running on the AWS cloud. For example, some of the data sources are web apps running in browsers, others are mobile applications, some are external data pipelines, etc. Here and here you can see how we implemented our data ingestion steps. All input data collected in S3 buckets and indexed by the creation date in AWS DynamoDB. Doing so allows us to process data batches by any given time interval. We are processing ±2TB data per day while having ‘special events’ days when the amount of data can be much bigger.

Problem Statement

Overall, AWS EMR does a great job. It is a reliable, scalable, and flexible tool to manage Apache Spark clusters. AWS EMR comes with out-of-the-box monitoring in a form of AWS Cloudwatch, it provides a rich toolbox that includes ZeppelinLivyHue, etc, and has very good security features. But AWS EMR has its own downgrades as well.

Portability: if you are building a multi-cloud or hybrid (cloud/on-prem) solution, be aware that migrating Spark Applications from AWS EMR can be a big deal. After running for a while on AWS EMR, you can find yourself tightly coupled to AWS specific features. It can be something simple, like logging and monitoring and it can be more complicated like an auto-scaling mechanism, custom master/worker AMIs, AWS security features, etc.

Cost overhead: the Amazon EMR price is in addition to the Amazon EC2 price. Take a look at the pricing example here

#kubernetes #aws #apache-spark #aws-eks #aws-emr

What is GEEK

Buddha Community

Migrating Apache Spark workloads from AWS EMR to Kubernetes
Christa  Stehr

Christa Stehr

1602964260

50+ Useful Kubernetes Tools for 2020 - Part 2

Introduction

Last year, we provided a list of Kubernetes tools that proved so popular we have decided to curate another list of some useful additions for working with the platform—among which are many tools that we personally use here at Caylent. Check out the original tools list here in case you missed it.

According to a recent survey done by Stackrox, the dominance Kubernetes enjoys in the market continues to be reinforced, with 86% of respondents using it for container orchestration.

(State of Kubernetes and Container Security, 2020)

And as you can see below, more and more companies are jumping into containerization for their apps. If you’re among them, here are some tools to aid you going forward as Kubernetes continues its rapid growth.

(State of Kubernetes and Container Security, 2020)

#blog #tools #amazon elastic kubernetes service #application security #aws kms #botkube #caylent #cli #container monitoring #container orchestration tools #container security #containers #continuous delivery #continuous deployment #continuous integration #contour #developers #development #developments #draft #eksctl #firewall #gcp #github #harbor #helm #helm charts #helm-2to3 #helm-aws-secret-plugin #helm-docs #helm-operator-get-started #helm-secrets #iam #json #k-rail #k3s #k3sup #k8s #keel.sh #keycloak #kiali #kiam #klum #knative #krew #ksniff #kube #kube-prod-runtime #kube-ps1 #kube-scan #kube-state-metrics #kube2iam #kubeapps #kubebuilder #kubeconfig #kubectl #kubectl-aws-secrets #kubefwd #kubernetes #kubernetes command line tool #kubernetes configuration #kubernetes deployment #kubernetes in development #kubernetes in production #kubernetes ingress #kubernetes interfaces #kubernetes monitoring #kubernetes networking #kubernetes observability #kubernetes plugins #kubernetes secrets #kubernetes security #kubernetes security best practices #kubernetes security vendors #kubernetes service discovery #kubernetic #kubesec #kubeterminal #kubeval #kudo #kuma #microsoft azure key vault #mozilla sops #octant #octarine #open source #palo alto kubernetes security #permission-manager #pgp #rafay #rakess #rancher #rook #secrets operations #serverless function #service mesh #shell-operator #snyk #snyk container #sonobuoy #strongdm #tcpdump #tenkai #testing #tigera #tilt #vert.x #wireshark #yaml

Wilford  Pagac

Wilford Pagac

1602302400

Migrating Apache Spark workloads from AWS EMR to Kubernetes

Introduction

I will focus on AWS Elastic Map Reduce since we are running our Spark workloads on AWS. We are using Apache Airflow for the workflow orchestration.

Image for post

Data Flow

The data comes from different sources that are spread across different geo regions and not necessarily running on the AWS cloud. For example, some of the data sources are web apps running in browsers, others are mobile applications, some are external data pipelines, etc. Here and here you can see how we implemented our data ingestion steps. All input data collected in S3 buckets and indexed by the creation date in AWS DynamoDB. Doing so allows us to process data batches by any given time interval. We are processing ±2TB data per day while having ‘special events’ days when the amount of data can be much bigger.

Problem Statement

Overall, AWS EMR does a great job. It is a reliable, scalable, and flexible tool to manage Apache Spark clusters. AWS EMR comes with out-of-the-box monitoring in a form of AWS Cloudwatch, it provides a rich toolbox that includes ZeppelinLivyHue, etc, and has very good security features. But AWS EMR has its own downgrades as well.

Portability: if you are building a multi-cloud or hybrid (cloud/on-prem) solution, be aware that migrating Spark Applications from AWS EMR can be a big deal. After running for a while on AWS EMR, you can find yourself tightly coupled to AWS specific features. It can be something simple, like logging and monitoring and it can be more complicated like an auto-scaling mechanism, custom master/worker AMIs, AWS security features, etc.

Cost overhead: the Amazon EMR price is in addition to the Amazon EC2 price. Take a look at the pricing example here

#kubernetes #aws #apache-spark #aws-eks #aws-emr

Roberta  Ward

Roberta Ward

1595344320

Wondering how to upgrade your skills in the pandemic? Here's a simple way you can do it.

Corona Virus Pandemic has brought the world to a standstill.

Countries are on a major lockdown. Schools, colleges, theatres, gym, clubs, and all other public places are shut down, the country’s economy is suffering, human health is on stake, people are losing their jobs and nobody knows how worse it can get.

Since most of the places are on lockdown, and you are working from home or have enough time to nourish your skills, then you should use this time wisely! We always complain that we want some ‘time’ to learn and upgrade our knowledge but don’t get it due to our ‘busy schedules’. So, now is the time to make a ‘list of skills’ and learn and upgrade your skills at home!

And for the technology-loving people like us, Knoldus Techhub has already helped us a lot in doing it in a short span of time!

If you are still not aware of it, don’t worry as Georgia Byng has well said,

“No time is better than the present”

– Georgia Byng, a British children’s writer, illustrator, actress and film producer.

No matter if you are a developer (be it front-end or back-end) or a data scientisttester, or a DevOps person, or, a learner who has a keen interest in technology, Knoldus Techhub has brought it all for you under one common roof.

From technologies like Scala, spark, elastic-search to angular, go, machine learning, it has a total of 20 technologies with some recently added ones i.e. DAML, test automation, snowflake, and ionic.

How to upgrade your skills?

Every technology in Tech-hub has n number of templates. Once you click on any specific technology you’ll be able to see all the templates of that technology. Since these templates are downloadable, you need to provide your email to get the template downloadable link in your mail.

These templates helps you learn the practical implementation of a topic with so much of ease. Using these templates you can learn and kick-start your development in no time.

Apart from your learning, there are some out of the box templates, that can help provide the solution to your business problem that has all the basic dependencies/ implementations already plugged in. Tech hub names these templates as xlr8rs (pronounced as accelerators).

xlr8rs make your development real fast by just adding your core business logic to the template.

If you are looking for a template that’s not available, you can also request a template may be for learning or requesting for a solution to your business problem and tech-hub will connect with you to provide you the solution. Isn’t this helpful 🙂

Confused with which technology to start with?

To keep you updated, the Knoldus tech hub provides you with the information on the most trending technology and the most downloaded templates at present. This you’ll be informed and learn the one that’s most trending.

Since we believe:

“There’s always a scope of improvement“

If you still feel like it isn’t helping you in learning and development, you can provide your feedback in the feedback section in the bottom right corner of the website.

#ai #akka #akka-http #akka-streams #amazon ec2 #angular 6 #angular 9 #angular material #apache flink #apache kafka #apache spark #api testing #artificial intelligence #aws #aws services #big data and fast data #blockchain #css #daml #devops #elasticsearch #flink #functional programming #future #grpc #html #hybrid application development #ionic framework #java #java11 #kubernetes #lagom #microservices #ml # ai and data engineering #mlflow #mlops #mobile development #mongodb #non-blocking #nosql #play #play 2.4.x #play framework #python #react #reactive application #reactive architecture #reactive programming #rust #scala #scalatest #slick #software #spark #spring boot #sql #streaming #tech blogs #testing #user interface (ui) #web #web application #web designing #angular #coronavirus #daml #development #devops #elasticsearch #golang #ionic #java #kafka #knoldus #lagom #learn #machine learning #ml #pandemic #play framework #scala #skills #snowflake #spark streaming #techhub #technology #test automation #time management #upgrade

AWS Fargate for Amazon Elastic Kubernetes Service | Caylent

On-demand cloud computing brings new ways to ensure scalability and efficiency. Rather than pre-allocating and managing certain server resources or having to go through the usual process of setting up a cloud cluster, apps and microservices can now rely on on-demand serverless computing blocks designed to be efficient and highly optimized.

Amazon Elastic Kubernetes Service (EKS) already makes running Kubernetes on AWS very easy. Support for AWS Fargate, which introduces the on-demand serverless computing element to the environment, makes deploying Kubernetes pods even easier and more efficient. AWS Fargate offers a wide range of features that make managing clusters and pods intuitive.

Utilizing Fargate
As with many other AWS services, using Fargate to manage Kubernetes clusters is very easy to do. To integrate Fargate and run a cluster on top of it, you only need to add the command –fargate to the end of your eksctl command.

EKS automatically configures the cluster to run on Fargate. It creates a pod execution role so that pod creation and management can be automated in an on-demand environment. It also patches coredns so the cluster can run smoothly on Fargate.

A Fargate profile is automatically created by the command. You can choose to customize the profile later or configure namespaces yourself, but the default profile is suitable for a wide range of applications already, requiring no human input other than a namespace for the cluster.

There are some prerequisites to keep in mind though. For starters, Fargate requires eksctl version 0.20.0 or later. Fargate also comes with some limitations, starting with support for only a handful of regions. For example, Fargate doesn’t support stateful apps, DaemonSets or privileged containers at the moment. Check out this link for Fargate limitations for your consideration.

Support for conventional load balancing is also limited, which is why ALB Ingress Controller is recommended. At the time of this writing, Classic Load Balancers and Network Load Balancers are not supported yet.

However, you can still be very meticulous in how you manage your clusters, including using different clusters to separate trusted and untrusted workloads.

Everything else is straightforward. Once the cluster is created, you can begin specifying pod execution roles for Fargate. You have the ability to use IAM console to create a role and assign it to a Fargate cluster. Or you can also create IAM roles and Fargate profiles via Terraform.

#aws #blog #amazon eks #aws fargate #aws management console #aws services #kubernetes #kubernetes clusters #kubernetes deployment #kubernetes pods

The Story of a Migration from EMR to Spark on Kubernetes

In this article, the co-founder of Lingk tells the story of their migration from EMR to the Spark-on-Kubernetes platform managed by Data Mechanics: their goals, the architecture of the solution & challenges they had to address, and the results they obtained.

Goals of this migration

Lingk.io is a data loading, data pipelines, and integration platform built on top of Apache Spark, serving commercial customers, with expertise in the education sector. In a few clicks from their visual interface, their customers can load, deduplicate, and enrich data from dozens of sources.

Under the hood, Lingk used AWS EMR (ElasticMapReduce) to power their product. But they were facing a few issues:

  • EMR required too much infrastructure management for their Devops team with limited Spark experience. Picking the right cluster instance types, memory settings, spark configs, etc.
  • Their total AWS costs were high — they had the intuition that the autoscaling policies of EMR were not very efficient, and that a lot of compute ressources were wasted.
  • Spark apps took 40 seconds to start on average. It’s a long time during which Lingk’s end users had to wait, particularly if they’re building a new data pipeline or integration.
  • The core Spark application was stuck at an earlier version because upgrading Spark to 3.0+ caused unexplained performance regressions.

#spark #apache-spark #data-engineering #kubernetes #emr