1620629020
The opportunities big data offers also come with very real challenges that many organizations are facing today. Often, it’s finding the most cost-effective, scalable way to store and process boundless volumes of data in multiple formats that come from a growing number of sources. Then organizations need the analytical capabilities and flexibility to turn this data into insights that can meet their specific business objectives.
This Refcard dives into how a data lake helps tackle these challenges at both ends — from its enhanced architecture that’s designed for efficient data ingestion, storage, and management to its advanced analytics functionality and performance flexibility. You’ll also explore key benefits and common use cases.
As technology continues to evolve with new data sources, such as IoT sensors and social media churning out large volumes of data, there has never been a better time to discuss the possibilities and challenges of managing such data for varying analytical insights. In this Refcard, we dig deep into how data lakes solve the problem of storing and processing enormous amounts of data. While doing so, we also explore the benefits of data lakes, their use cases, and how they differ from data warehouses (DWHs).
This is a preview of the Getting Started With Data Lakes Refcard. To read the entire Refcard, please download the PDF from the link above.
#big data #data analytics #data analysis #business analytics #data warehouse #data storage #data lake #data lake architecture #data lake governance #data lake management
1620629020
The opportunities big data offers also come with very real challenges that many organizations are facing today. Often, it’s finding the most cost-effective, scalable way to store and process boundless volumes of data in multiple formats that come from a growing number of sources. Then organizations need the analytical capabilities and flexibility to turn this data into insights that can meet their specific business objectives.
This Refcard dives into how a data lake helps tackle these challenges at both ends — from its enhanced architecture that’s designed for efficient data ingestion, storage, and management to its advanced analytics functionality and performance flexibility. You’ll also explore key benefits and common use cases.
As technology continues to evolve with new data sources, such as IoT sensors and social media churning out large volumes of data, there has never been a better time to discuss the possibilities and challenges of managing such data for varying analytical insights. In this Refcard, we dig deep into how data lakes solve the problem of storing and processing enormous amounts of data. While doing so, we also explore the benefits of data lakes, their use cases, and how they differ from data warehouses (DWHs).
This is a preview of the Getting Started With Data Lakes Refcard. To read the entire Refcard, please download the PDF from the link above.
#big data #data analytics #data analysis #business analytics #data warehouse #data storage #data lake #data lake architecture #data lake governance #data lake management
1657081614
In this article, We will show how we can use python to automate Excel . A useful Python library is Openpyxl which we will learn to do Excel Automation
Openpyxl is a Python library that is used to read from an Excel file or write to an Excel file. Data scientists use Openpyxl for data analysis, data copying, data mining, drawing charts, styling sheets, adding formulas, and more.
Workbook: A spreadsheet is represented as a workbook in openpyxl. A workbook consists of one or more sheets.
Sheet: A sheet is a single page composed of cells for organizing data.
Cell: The intersection of a row and a column is called a cell. Usually represented by A1, B5, etc.
Row: A row is a horizontal line represented by a number (1,2, etc.).
Column: A column is a vertical line represented by a capital letter (A, B, etc.).
Openpyxl can be installed using the pip command and it is recommended to install it in a virtual environment.
pip install openpyxl
We start by creating a new spreadsheet, which is called a workbook in Openpyxl. We import the workbook module from Openpyxl and use the function Workbook()
which creates a new workbook.
from openpyxl
import Workbook
#creates a new workbook
wb = Workbook()
#Gets the first active worksheet
ws = wb.active
#creating new worksheets by using the create_sheet method
ws1 = wb.create_sheet("sheet1", 0) #inserts at first position
ws2 = wb.create_sheet("sheet2") #inserts at last position
ws3 = wb.create_sheet("sheet3", -1) #inserts at penultimate position
#Renaming the sheet
ws.title = "Example"
#save the workbook
wb.save(filename = "example.xlsx")
We load the file using the function load_Workbook()
which takes the filename as an argument. The file must be saved in the same working directory.
#loading a workbook
wb = openpyxl.load_workbook("example.xlsx")
#getting sheet names
wb.sheetnames
result = ['sheet1', 'Sheet', 'sheet3', 'sheet2']
#getting a particular sheet
sheet1 = wb["sheet2"]
#getting sheet title
sheet1.title
result = 'sheet2'
#Getting the active sheet
sheetactive = wb.active
result = 'sheet1'
#get a cell from the sheet
sheet1["A1"] <
Cell 'Sheet1'.A1 >
#get the cell value
ws["A1"].value 'Segment'
#accessing cell using row and column and assigning a value
d = ws.cell(row = 4, column = 2, value = 10)
d.value
10
#looping through each row and column
for x in range(1, 5):
for y in range(1, 5):
print(x, y, ws.cell(row = x, column = y)
.value)
#getting the highest row number
ws.max_row
701
#getting the highest column number
ws.max_column
19
There are two functions for iterating through rows and columns.
Iter_rows() => returns the rows
Iter_cols() => returns the columns {
min_row = 4, max_row = 5, min_col = 2, max_col = 5
} => This can be used to set the boundaries
for any iteration.
Example:
#iterating rows
for row in ws.iter_rows(min_row = 2, max_col = 3, max_row = 3):
for cell in row:
print(cell) <
Cell 'Sheet1'.A2 >
<
Cell 'Sheet1'.B2 >
<
Cell 'Sheet1'.C2 >
<
Cell 'Sheet1'.A3 >
<
Cell 'Sheet1'.B3 >
<
Cell 'Sheet1'.C3 >
#iterating columns
for col in ws.iter_cols(min_row = 2, max_col = 3, max_row = 3):
for cell in col:
print(cell) <
Cell 'Sheet1'.A2 >
<
Cell 'Sheet1'.A3 >
<
Cell 'Sheet1'.B2 >
<
Cell 'Sheet1'.B3 >
<
Cell 'Sheet1'.C2 >
<
Cell 'Sheet1'.C3 >
To get all the rows of the worksheet we use the method worksheet.rows and to get all the columns of the worksheet we use the method worksheet.columns. Similarly, to iterate only through the values we use the method worksheet.values.
Example:
for row in ws.values:
for value in row:
print(value)
Writing to a workbook can be done in many ways such as adding a formula, adding charts, images, updating cell values, inserting rows and columns, etc… We will discuss each of these with an example.
#creates a new workbook
wb = openpyxl.Workbook()
#saving the workbook
wb.save("new.xlsx")
#creating a new sheet
ws1 = wb.create_sheet(title = "sheet 2")
#creating a new sheet at index 0
ws2 = wb.create_sheet(index = 0, title = "sheet 0")
#checking the sheet names
wb.sheetnames['sheet 0', 'Sheet', 'sheet 2']
#deleting a sheet
del wb['sheet 0']
#checking sheetnames
wb.sheetnames['Sheet', 'sheet 2']
#checking the sheet value
ws['B2'].value
null
#adding value to cell
ws['B2'] = 367
#checking value
ws['B2'].value
367
We often require formulas to be included in our Excel datasheet. We can easily add formulas using the Openpyxl module just like you add values to a cell.
For example:
import openpyxl
from openpyxl
import Workbook
wb = openpyxl.load_workbook("new1.xlsx")
ws = wb['Sheet']
ws['A9'] = '=SUM(A2:A8)'
wb.save("new2.xlsx")
The above program will add the formula (=SUM(A2:A8)) in cell A9. The result will be as below.
Two or more cells can be merged to a rectangular area using the method merge_cells(), and similarly, they can be unmerged using the method unmerge_cells().
For example:
Merge cells
#merge cells B2 to C9
ws.merge_cells('B2:C9')
ws['B2'] = "Merged cells"
Adding the above code to the previous example will merge cells as below.
#unmerge cells B2 to C9
ws.unmerge_cells('B2:C9')
The above code will unmerge cells from B2 to C9.
To insert an image we import the image function from the module openpyxl.drawing.image. We then load our image and add it to the cell as shown in the below example.
Example:
import openpyxl
from openpyxl
import Workbook
from openpyxl.drawing.image
import Image
wb = openpyxl.load_workbook("new1.xlsx")
ws = wb['Sheet']
#loading the image(should be in same folder)
img = Image('logo.png')
ws['A1'] = "Adding image"
#adjusting size
img.height = 130
img.width = 200
#adding img to cell A3
ws.add_image(img, 'A3')
wb.save("new2.xlsx")
Result:
Charts are essential to show a visualization of data. We can create charts from Excel data using the Openpyxl module chart. Different forms of charts such as line charts, bar charts, 3D line charts, etc., can be created. We need to create a reference that contains the data to be used for the chart, which is nothing but a selection of cells (rows and columns). I am using sample data to create a 3D bar chart in the below example:
Example
import openpyxl
from openpyxl
import Workbook
from openpyxl.chart
import BarChart3D, Reference, series
wb = openpyxl.load_workbook("example.xlsx")
ws = wb.active
values = Reference(ws, min_col = 3, min_row = 2, max_col = 3, max_row = 40)
chart = BarChart3D()
chart.add_data(values)
ws.add_chart(chart, "E3")
wb.save("MyChart.xlsx")
Result
Welcome to another video! In this video, We will cover how we can use python to automate Excel. I'll be going over everything from creating workbooks to accessing individual cells and stylizing cells. There is a ton of things that you can do with Excel but I'll just be covering the core/base things in OpenPyXl.
⭐️ Timestamps ⭐️
00:00 | Introduction
02:14 | Installing openpyxl
03:19 | Testing Installation
04:25 | Loading an Existing Workbook
06:46 | Accessing Worksheets
07:37 | Accessing Cell Values
08:58 | Saving Workbooks
09:52 | Creating, Listing and Changing Sheets
11:50 | Creating a New Workbook
12:39 | Adding/Appending Rows
14:26 | Accessing Multiple Cells
20:46 | Merging Cells
22:27 | Inserting and Deleting Rows
23:35 | Inserting and Deleting Columns
24:48 | Copying and Moving Cells
26:06 | Practical Example, Formulas & Cell Styling
📄 Resources 📄
OpenPyXL Docs: https://openpyxl.readthedocs.io/en/stable/
Code Written in This Tutorial: https://github.com/techwithtim/ExcelPythonTutorial
Subscribe: https://www.youtube.com/c/TechWithTim/featured
1595595360
We recently wrote an article debunking common myths about data lake architectures, data lake definitions, and data lake analytics. It is called "What is a Data Lake_? Get A Leg Up Avoiding The Biggest Myths." _In that article, we framed the current conversation about data lakes and how they fit within enterprise data strategies. This topic has historically been confusing and opaque for those wanting to get value from a data lake due to conflicting advice from consultants and vendors.
One area that can be particularly confusing is the perception that lakes are only for “big data.” If you spend any time reading materials on lakes, you would think there is only one type and it would look like the Capsian Sea (it’s a lake despite “sea” in the name). People describe data lakes as massive, all-encompassing entities, designed to hold all knowledge. The good news is that lakes are not just for “big data” and you have more opportunities than ever to have them be part of your data stack.
Just as they do in nature, lakes come in all different shapes and sizes. Each has a natural state, often reflecting ecosystems of data, just like those in nature reflect ecosystems of fish, birds, or other organisms.
Unfortunately, the “big data” angle gives the impression that lakes are only for “Caspian” scale data endeavors. This certainly makes the use of data lakes intimidating. As a result, describing things in such massive terms makes the concept of a lake inaccessible to those who can benefit from them on a smaller scale. Here are a few data lake examples;
We recently worked with a customer to create a “Domain” type lake. This lake would hold Adobe event data to an AWS to support an enterprise Oracle Cloud environment. Why AWS to Oracle? It was an efficient and cost-effective data consumption pattern for the customer Oracle BI environment, especially considering the agility and economics of using an AWS lake and Athena as the on-demand query service for lake content.
By design, all types of lakes should embrace an abstraction that minimizes risk and affords you greater flexibility. Also, they should be structured for easy consumption independent of their size. This ensures a lake used by a data scientist or business user or analyst all have an environment structured for easy data consumption.
Being a successful early adopter means taking a business value approach rather than a technology one. Here are a few tips as you think about how to get started:
#big data #data lake #data lakes #data lake architecture #data lake solutions #data analysis
1624546800
As data mesh advocates come to suggest that the data mesh should replace the monolithic, centralized data lake, I wanted to check in with Dipti Borkar, co-founder and Chief Product Officer at Ahana. Dipti has been a tremendous resource for me over the years as she has held leadership positions at Couchbase, Kinetica, and Alluxio.
According to Dipti, while data lakes and data mesh both have use cases they work well for, data mesh can’t replace the data lake unless all data sources are created equal — and for many, that’s not the case.
All data sources are not equal. There are different dimensions of data:
Each data source has its purpose. Some are built for fast access for small amounts of data, some are meant for real transactions, some are meant for data that applications need, and some are meant for getting insights on large amounts of data.
Things changed when AWS commoditized the storage layer with the AWS S3 object-store 15 years ago. Given the ubiquity and affordability of S3 and other cloud storage, companies are moving most of this data to cloud object stores and building data lakes, where it can be analyzed in many different ways.
Because of the low cost, enterprises can store all of their data — enterprise, third-party, IoT, and streaming — into an S3 data lake. However, the data cannot be processed there. You need engines on top like Hive, Presto, and Spark to process it. Hadoop tried to do this with limited success. Presto and Spark have solved the SQL in S3 query problem.
#big data #big data analytics #data lake #data lake and data mesh #data lake #data mesh
1620466520
If you accumulate data on which you base your decision-making as an organization, you should probably think about your data architecture and possible best practices.
If you accumulate data on which you base your decision-making as an organization, you most probably need to think about your data architecture and consider possible best practices. Gaining a competitive edge, remaining customer-centric to the greatest extent possible, and streamlining processes to get on-the-button outcomes can all be traced back to an organization’s capacity to build a future-ready data architecture.
In what follows, we offer a short overview of the overarching capabilities of data architecture. These include user-centricity, elasticity, robustness, and the capacity to ensure the seamless flow of data at all times. Added to these are automation enablement, plus security and data governance considerations. These points from our checklist for what we perceive to be an anticipatory analytics ecosystem.
#big data #data science #big data analytics #data analysis #data architecture #data transformation #data platform #data strategy #cloud data platform #data acquisition