Bongani  Ngema

Bongani Ngema


How to Build Enterprise Data Lake with AWS Cloud

Data Lake

A Data Lake is a place to store enterprise data in one common place. This data can be further accessed by data wranglers with analytical needs. However, a data lake is different from a normal database. As a data lake can store current and historical data for different systems in its raw form for analysis. And, a database stores current updated data for an application. Now this data which organisations preserve can be in any shape or format – structured, unstructured or semi-structured. Also, it can be saved in any desired format like CSV, Apache Parquet, XML, JSON etc. When we talk about data this data can have no limit on size. So, we need a mechanism in place to ingest this data by batch or stream processing most of the times. Potential users of this data also look forward to secure this data lake and ensure data governance. Hence, we need a data lake which is secure with proper security and controls governing access. This should be independent of data access methods.

Data Lake Benefits

  • Accessibility of data by storing it at common place. This is accessible by everyone based on privileges set by data custodians (who manage and owns this data).
  • Store raw data at scale for a low cost.
  • Unlock the data from different domains just in few clicks
  • Provide leading industry experience to different data personas.
  • Ensure the value associated with each data stored in lake to provide valuable experience and competitive edge over each other.
  • Make it more comprehensive with desired search, filtering and navigation capabilities to make it work like a search engine aka. Google for your organisation.

Now to make this data lake accessible to users we need a web based application. A data catalog can be one form to address this need which would act as a persistent metadata store that facilitates data exploration around different data stores.

Data Lake (ELT Tool) vs. Data Warehouse (ETL Tool)

Let’s try to understand how this data lake is different from a data warehouse. ETL (Extract Transform and Load) is what happens within a Data Warehouse and ELT (Extract Load and Transform) within a Data Lake. DWH (Data warehouse) serves as an integration platform for data from different data sources. It creates a structured data during ETL which can be used for various analytical needs whereas a DL (Data Lake) can preserve data in structured, unstructured or semi-structured format without specific purpose or need. This data from data lake gets value out of it over period of time with gradual transformation and other other analytical processes. Also, schema of this data is defined at time of processing or reading in lake. So, data in data lake is highly configurable, agile based on requirement. Data Lake works well with real time and big data needs. Hence, when a business has drastically changing data need one should build a data lake whereas for slowly changing structured data needs one can go with building data warehouse.

Data Lake for Big Data

In this age of big data which is collecting several millions of rows of data per second in any format can be stored and used with data lake. Another addition to this is Data Vault methodology and modelling which is a governed data lake that address some of the limitations of DWH. Vault provides durability and accelerates business value.

Deploying Data Lakes on Cloud

A data lake is considered as an ideal workload to be deployed in cloud for scalability, reliability, availability, performance, and analytics purposes. Users perceive cloud as a benefit to deploy data lake for better security, faster deployment time, elasticity, pay as use model, and for more coverage across different geographies.

Build Data Lake via. AWS Cloud

Now let’s discuss the final part of this discussion – how can we build a data lake on cloud using different AWS services.

Data Collection: Collect & Extract data from different sources including formats like flat files, API’s, or any SQL, No-SQL database or from some cloud storage like S3.

Data Load: Load this raw unprocessed data into AWS S3 bucket for storage. This bucket will act as a landing bucket.

Data Transformation: Then use ETL tool like AWS Glue for various data processing and transformations.

Data Governance: We can further enable security settings and access controls on this data and ensure data governance on top of this transformed-processed data. A data-catalog can be build for storing metadata and further exploration around different data stores.

Data Curation: We can curate this processed data in another target S3 bucket or in AWS Redshift (as a DWH).

Data Notification & Monitoring: AWS SNS can be used for intermediate notifications and alerting mechanism for various jobs. AWS cloudwatch can be used for monitoring and logging.

Data Analytics: From second S3 bucket or Redhift where transformed data was curated we can query and analyse data for various business requirements via. AWS Athena, QuickSight. Also, data scientists can use this data for building & training various ML models.

Original article source at:

#aws #cloud #data 

What is GEEK

Buddha Community

How to Build Enterprise Data Lake with AWS Cloud
Gerhard  Brink

Gerhard Brink


Getting Started With Data Lakes

Frameworks for Efficient Enterprise Analytics

The opportunities big data offers also come with very real challenges that many organizations are facing today. Often, it’s finding the most cost-effective, scalable way to store and process boundless volumes of data in multiple formats that come from a growing number of sources. Then organizations need the analytical capabilities and flexibility to turn this data into insights that can meet their specific business objectives.

This Refcard dives into how a data lake helps tackle these challenges at both ends — from its enhanced architecture that’s designed for efficient data ingestion, storage, and management to its advanced analytics functionality and performance flexibility. You’ll also explore key benefits and common use cases.


As technology continues to evolve with new data sources, such as IoT sensors and social media churning out large volumes of data, there has never been a better time to discuss the possibilities and challenges of managing such data for varying analytical insights. In this Refcard, we dig deep into how data lakes solve the problem of storing and processing enormous amounts of data. While doing so, we also explore the benefits of data lakes, their use cases, and how they differ from data warehouses (DWHs).

This is a preview of the Getting Started With Data Lakes Refcard. To read the entire Refcard, please download the PDF from the link above.

#big data #data analytics #data analysis #business analytics #data warehouse #data storage #data lake #data lake architecture #data lake governance #data lake management

Database Vs Data Warehouse Vs Data Lake: A Simple Explanation

Databases store data in a structured form. The structure makes it possible to find and edit data. With their structured structure, databases are used for data management, data storage, data evaluation, and targeted processing of data.
In this sense, data is all information that is to be saved and later reused in various contexts. These can be date and time values, texts, addresses, numbers, but also pictures. The data should be able to be evaluated and processed later.

The amount of data the database could store is limited, so enterprise companies tend to use data warehouses, which are versions for huge streams of data.

#data-warehouse #data-lake #cloud-data-warehouse #what-is-aws-data-lake #data-science #data-analytics #database #big-data #web-monetization

Adaline  Kulas

Adaline Kulas


Multi-cloud Spending: 8 Tips To Lower Cost

A multi-cloud approach is nothing but leveraging two or more cloud platforms for meeting the various business requirements of an enterprise. The multi-cloud IT environment incorporates different clouds from multiple vendors and negates the dependence on a single public cloud service provider. Thus enterprises can choose specific services from multiple public clouds and reap the benefits of each.

Given its affordability and agility, most enterprises opt for a multi-cloud approach in cloud computing now. A 2018 survey on the public cloud services market points out that 81% of the respondents use services from two or more providers. Subsequently, the cloud computing services market has reported incredible growth in recent times. The worldwide public cloud services market is all set to reach $500 billion in the next four years, according to IDC.

By choosing multi-cloud solutions strategically, enterprises can optimize the benefits of cloud computing and aim for some key competitive advantages. They can avoid the lengthy and cumbersome processes involved in buying, installing and testing high-priced systems. The IaaS and PaaS solutions have become a windfall for the enterprise’s budget as it does not incur huge up-front capital expenditure.

However, cost optimization is still a challenge while facilitating a multi-cloud environment and a large number of enterprises end up overpaying with or without realizing it. The below-mentioned tips would help you ensure the money is spent wisely on cloud computing services.

  • Deactivate underused or unattached resources

Most organizations tend to get wrong with simple things which turn out to be the root cause for needless spending and resource wastage. The first step to cost optimization in your cloud strategy is to identify underutilized resources that you have been paying for.

Enterprises often continue to pay for resources that have been purchased earlier but are no longer useful. Identifying such unused and unattached resources and deactivating it on a regular basis brings you one step closer to cost optimization. If needed, you can deploy automated cloud management tools that are largely helpful in providing the analytics needed to optimize the cloud spending and cut costs on an ongoing basis.

  • Figure out idle instances

Another key cost optimization strategy is to identify the idle computing instances and consolidate them into fewer instances. An idle computing instance may require a CPU utilization level of 1-5%, but you may be billed by the service provider for 100% for the same instance.

Every enterprise will have such non-production instances that constitute unnecessary storage space and lead to overpaying. Re-evaluating your resource allocations regularly and removing unnecessary storage may help you save money significantly. Resource allocation is not only a matter of CPU and memory but also it is linked to the storage, network, and various other factors.

  • Deploy monitoring mechanisms

The key to efficient cost reduction in cloud computing technology lies in proactive monitoring. A comprehensive view of the cloud usage helps enterprises to monitor and minimize unnecessary spending. You can make use of various mechanisms for monitoring computing demand.

For instance, you can use a heatmap to understand the highs and lows in computing visually. This heat map indicates the start and stop times which in turn lead to reduced costs. You can also deploy automated tools that help organizations to schedule instances to start and stop. By following a heatmap, you can understand whether it is safe to shut down servers on holidays or weekends.

#cloud computing services #all #hybrid cloud #cloud #multi-cloud strategy #cloud spend #multi-cloud spending #multi cloud adoption #why multi cloud #multi cloud trends #multi cloud companies #multi cloud research #multi cloud market

 iOS App Dev

iOS App Dev


Your Data Architecture: Simple Best Practices for Your Data Strategy

If you accumulate data on which you base your decision-making as an organization, you should probably think about your data architecture and possible best practices.

If you accumulate data on which you base your decision-making as an organization, you most probably need to think about your data architecture and consider possible best practices. Gaining a competitive edge, remaining customer-centric to the greatest extent possible, and streamlining processes to get on-the-button outcomes can all be traced back to an organization’s capacity to build a future-ready data architecture.

In what follows, we offer a short overview of the overarching capabilities of data architecture. These include user-centricity, elasticity, robustness, and the capacity to ensure the seamless flow of data at all times. Added to these are automation enablement, plus security and data governance considerations. These points from our checklist for what we perceive to be an anticipatory analytics ecosystem.

#big data #data science #big data analytics #data analysis #data architecture #data transformation #data platform #data strategy #cloud data platform #data acquisition

Data Lakes Are Not Just For Big Data - DZone Big Data

We recently wrote an article debunking common myths about data lake architectures, data lake definitions, and data lake analytics. It is called "What is a Data Lake_? Get A Leg Up Avoiding The Biggest Myths." _In that article, we framed the current conversation about data lakes and how they fit within enterprise data strategies. This topic has historically been confusing and opaque for those wanting to get value from a data lake due to conflicting advice from consultants and vendors.

One area that can be particularly confusing is the perception that lakes are only for “big data.” If you spend any time reading materials on lakes, you would think there is only one type and it would look like the Capsian Sea (it’s a lake despite “sea” in the name). People describe data lakes as massive, all-encompassing entities, designed to hold all knowledge. The good news is that lakes are not just for “big data” and you have more opportunities than ever to have them be part of your data stack.

Yes, There Are Different Types of Data Lakes

Just as they do in nature, lakes come in all different shapes and sizes. Each has a natural state, often reflecting ecosystems of data, just like those in nature reflect ecosystems of fish, birds, or other organisms.

Unfortunately, the “big data” angle gives the impression that lakes are only for “Caspian” scale data endeavors. This certainly makes the use of data lakes intimidating. As a result, describing things in such massive terms makes the concept of a lake inaccessible to those who can benefit from them on a smaller scale. Here are a few data lake examples;

  • **The Great “Caspian”: ** Just like the Caspian is a large body of water, this type of lake is a large, broad repository-diverse set of data. This broad collection of diverse data reflects information from across the enterprise. This is how most data lake efforts are framed.
  • **Temporary “Ephemeral”: **Just like deserts can have small, temporary lakes, an Ephemeral exists for a short period of time. They may be used for a project, pilot, PoC or a point solution and they are turned off as quickly as they were turned on.
  • **Domain “Project”: **These lakes, like Ephemeral data lakes, are often focused on specific knowledge domains. However, unlike the Ephemeral lake, this lake will persist over time. These may also be “shallow,” meaning they may be focused on a narrow domain of data such as media, social, web analytics, email, or similar data sources.

We recently worked with a customer to create a “Domain” type lake. This lake would hold Adobe event data to an AWS to support an enterprise Oracle Cloud environment. Why AWS to Oracle? It was an efficient and cost-effective data consumption pattern for the customer Oracle BI environment, especially considering the agility and economics of using an AWS lake and Athena as the on-demand query service for lake content.

By design, all types of lakes should embrace an abstraction that minimizes risk and affords you greater flexibility. Also, they should be structured for easy consumption independent of their size. This ensures a lake used by a data scientist or business user or analyst all have an environment structured for easy data consumption.

Getting Started With Data Lakes

Being a successful early adopter means taking a business value approach rather than a technology one. Here are a few tips as you think about how to get started:

  • Focus: Seek opportunities where you can deploy an “Ephemeral” or “Project” solution. This will ensure you reduce risk and overcome technical and organizational challenges so your team can build confidence with lakes.
  • Passion: Make sure you have an “evangelist” or “advocate” internally, someone who is passionate about the solution and adoption within the company.
  • Simple: Embrace simplicity and agility, put people, processes, and technology choices through this lens. The lack of complexity should not be seen as a deficiency but a byproduct of thoughtful design.
  • Narrow: Keep the scope narrow and well defined by limiting your lake to understand data, say exports from ERP, CRM, Point-of-Sales, Marketing, or Advertising data. Data literacy at this stage will help you understand workflow around data structure, ingest, governance, quality, and testing.
  • Experiment: Pair your lake with a modern BI and analytics tools like Tableau, Power BI, Amazon Quicksight, or Looker. This will allow non-technical users an opportunity to experiment and explore data access via a lake. This allows you to engage a different user base that can assess performance bottlenecks, discover opportunities for improvements, possible linkages to any existing EDW systems (or other data systems), and additional candidate data sources.

#big data #data lake #data lakes #data lake architecture #data lake solutions #data analysis