1596479105
Why did I stop using Keras- ImageDataGenerator? Simply because it is slow, in fact, **5–8 times **slower than Tensorflow- tf.data when loading the images.
Image classification and object detection are few of the problems where Artificial Intelligence has become extremely good (of course there is always room for improvement). TensorFlow, Keras, PyTorch, among other open-source Machine Learning frameworks, are extensively used by the research community to train their own models or use the freely available pre-trained models.
Photo by the Author. From YouTube Video
The first step in any machine learning problem is to have a good/clean dataset and then **LOAD this Dataset **to train your model. In this article, I will discuss **two different ways to load **an image dataset — using Keras or TensorFlow (tf.data) and will show the performance difference.
** cache
variable will be shown/defined later in the codes.
Using Loading data
Keras.ImageDataGenerator 479 images/sec
tf.data (cache=True) 2511 images/sec
tf.data (cache=False) 913 images/sec
tf.data (cache='some_path.tfcache') 359 (first time) -> 1911 (next times) images/sec
```<iframe class="ql-video" frameborder="0" allowfullscreen="true" src="https://medium.com/media/dff00885003d06d78354c525fd73957e" height="0" width="0"></iframe>
Here, I have shown a comparison of **how many images per second **are loaded by Keras.ImageDataGenerator and TensorFlow’s- tf.data (using 3 different cases for inbuild`cache` variable as shown in the above table). The above results are compared on a workstation with the ubuntu 20.04 operating system having 16-GB RAM, 2.80 GHz with Core i7\. The dataset was downloaded from [Kaggle- dogs_and_cats](https://www.kaggle.com/chetankv/dogs-cats-images) having 10000 images in total (train-8000, validation-2000).
#machine-learning #towards-data-science #artificial-intelligence #tensorflow #keras
1620466520
If you accumulate data on which you base your decision-making as an organization, you should probably think about your data architecture and possible best practices.
If you accumulate data on which you base your decision-making as an organization, you most probably need to think about your data architecture and consider possible best practices. Gaining a competitive edge, remaining customer-centric to the greatest extent possible, and streamlining processes to get on-the-button outcomes can all be traced back to an organization’s capacity to build a future-ready data architecture.
In what follows, we offer a short overview of the overarching capabilities of data architecture. These include user-centricity, elasticity, robustness, and the capacity to ensure the seamless flow of data at all times. Added to these are automation enablement, plus security and data governance considerations. These points from our checklist for what we perceive to be an anticipatory analytics ecosystem.
#big data #data science #big data analytics #data analysis #data architecture #data transformation #data platform #data strategy #cloud data platform #data acquisition
1596479105
Why did I stop using Keras- ImageDataGenerator? Simply because it is slow, in fact, **5–8 times **slower than Tensorflow- tf.data when loading the images.
Image classification and object detection are few of the problems where Artificial Intelligence has become extremely good (of course there is always room for improvement). TensorFlow, Keras, PyTorch, among other open-source Machine Learning frameworks, are extensively used by the research community to train their own models or use the freely available pre-trained models.
Photo by the Author. From YouTube Video
The first step in any machine learning problem is to have a good/clean dataset and then **LOAD this Dataset **to train your model. In this article, I will discuss **two different ways to load **an image dataset — using Keras or TensorFlow (tf.data) and will show the performance difference.
** cache
variable will be shown/defined later in the codes.
Using Loading data
Keras.ImageDataGenerator 479 images/sec
tf.data (cache=True) 2511 images/sec
tf.data (cache=False) 913 images/sec
tf.data (cache='some_path.tfcache') 359 (first time) -> 1911 (next times) images/sec
```<iframe class="ql-video" frameborder="0" allowfullscreen="true" src="https://medium.com/media/dff00885003d06d78354c525fd73957e" height="0" width="0"></iframe>
Here, I have shown a comparison of **how many images per second **are loaded by Keras.ImageDataGenerator and TensorFlow’s- tf.data (using 3 different cases for inbuild`cache` variable as shown in the above table). The above results are compared on a workstation with the ubuntu 20.04 operating system having 16-GB RAM, 2.80 GHz with Core i7\. The dataset was downloaded from [Kaggle- dogs_and_cats](https://www.kaggle.com/chetankv/dogs-cats-images) having 10000 images in total (train-8000, validation-2000).
#machine-learning #towards-data-science #artificial-intelligence #tensorflow #keras
1620629020
The opportunities big data offers also come with very real challenges that many organizations are facing today. Often, it’s finding the most cost-effective, scalable way to store and process boundless volumes of data in multiple formats that come from a growing number of sources. Then organizations need the analytical capabilities and flexibility to turn this data into insights that can meet their specific business objectives.
This Refcard dives into how a data lake helps tackle these challenges at both ends — from its enhanced architecture that’s designed for efficient data ingestion, storage, and management to its advanced analytics functionality and performance flexibility. You’ll also explore key benefits and common use cases.
As technology continues to evolve with new data sources, such as IoT sensors and social media churning out large volumes of data, there has never been a better time to discuss the possibilities and challenges of managing such data for varying analytical insights. In this Refcard, we dig deep into how data lakes solve the problem of storing and processing enormous amounts of data. While doing so, we also explore the benefits of data lakes, their use cases, and how they differ from data warehouses (DWHs).
This is a preview of the Getting Started With Data Lakes Refcard. To read the entire Refcard, please download the PDF from the link above.
#big data #data analytics #data analysis #business analytics #data warehouse #data storage #data lake #data lake architecture #data lake governance #data lake management
1597797000
This article is the follow up from Part 1. Here, I will compare the tf.data
and Keras.ImageDataGenerator
actual training times using the mobilenet
model.
Photo by the Author. From YouTube Video
In Part 1, I showed that loading images using tf.data
is approximately 5 times faster in comparison toKeras.ImageDataGenerator
. The dataset considered was Kaggle- dogs_and_cats (217 MB) having **10000 images **distributed among 2 different classes.
In this Part 2, I have considered a bigger dataset which is commonly used for image classification problems. The dataset chosen is COCO2017 (18 GB) having **117266 images **distributed among **80 different classes. **Various versions of COCO datasets are freely available to try and test at this link. The reason for choosing the bigger dataset of 18 GB is to have better comparison results. For a practical image classification problem, datasets can be even bigger ranging from 100 GB (gigabytes)to a few TB (terabytes). In our case, 18 GB of data is enough to understand the comparison as using the dataset in TB would significantly increase the training times and computational resources.
#keras #machine-learning #deep-learning #tensorflow #artificial-intelligence #deep learning
1624546800
As data mesh advocates come to suggest that the data mesh should replace the monolithic, centralized data lake, I wanted to check in with Dipti Borkar, co-founder and Chief Product Officer at Ahana. Dipti has been a tremendous resource for me over the years as she has held leadership positions at Couchbase, Kinetica, and Alluxio.
According to Dipti, while data lakes and data mesh both have use cases they work well for, data mesh can’t replace the data lake unless all data sources are created equal — and for many, that’s not the case.
All data sources are not equal. There are different dimensions of data:
Each data source has its purpose. Some are built for fast access for small amounts of data, some are meant for real transactions, some are meant for data that applications need, and some are meant for getting insights on large amounts of data.
Things changed when AWS commoditized the storage layer with the AWS S3 object-store 15 years ago. Given the ubiquity and affordability of S3 and other cloud storage, companies are moving most of this data to cloud object stores and building data lakes, where it can be analyzed in many different ways.
Because of the low cost, enterprises can store all of their data — enterprise, third-party, IoT, and streaming — into an S3 data lake. However, the data cannot be processed there. You need engines on top like Hive, Presto, and Spark to process it. Hadoop tried to do this with limited success. Presto and Spark have solved the SQL in S3 query problem.
#big data #big data analytics #data lake #data lake and data mesh #data lake #data mesh