Building efficient data pipelines using TensorFlow

Building efficient data pipelines using TensorFlow

<strong>Having efficient data pipelines is of paramount importance for any machine learning model. In this blog, we will learn how to use TensorFlow’s Dataset module tf.data to build efficient data pipelines.</strong>

Having efficient data pipelines is of paramount importance for any machine learning model. In this blog, we will learn how to use TensorFlow’s Dataset module tf.data to build efficient data pipelines.

Motivation

Most of the introductory articles on TensorFlow would introduce you with the feed_dict method of feeding the data to the model. feed_dict processes the input data in a single thread and while the data is being loaded and processed on CPU, the GPU remains idle and when the GPU is training a batch of data, CPU remains in the idle state. The developers of TensorFlow have advised not to use this method during training or repeated validation of the same datasets.

tf_data improves the performance by prefetching the next batch of data asynchronously so that GPU need not wait for the data. You can also parallelize the process of preprocessing and loading the dataset.

In this blog post we will cover Datasets and Iterators. We will learn how to create Datasets from source data, apply the transformation to Dataset and then consume the data using Iterators.

How to create Datasets?

Tensorflow provides various methods to create Datasets from numpy arrays, text files, CSV files, tensors, etc. Let’s look at few methods below

  • from_tensor_slices: It accepts single or multiple numpy arrays or tensors. Dataset created using this method will emit only one data at a time.
# source data - numpy array
data = np.arange(10)
# create a dataset from numpy array
dataset = tf.data.Dataset.from_tensor_slices(data)

The object dataset is a tensorflow Dataset object.

  • from_tensors: It also accepts single or multiple numpy arrays or tensors. Dataset created using this method will emit all the data at once.
data = tf.arange(10)
dataset = tf.data.Dataset.from_tensors(data)

3. from_generator: Creates a Dataset whose elements are generated by a function.

def generator():
  for i in range(10):
    yield 2*i

dataset = tf.data.Dataset.from_generator(generator, (tf.int32))

Operations on Datasets

  • Batches: Combines consecutive elements of the Dataset into a single batch. Useful when you want to train smaller batches of data to avoid out of memory errors.
data = np.arange(10,40)

create batches of 10

dataset = tf.data.Dataset.from_tensor_slices(data).batch(10)

creates the iterator to consume the data

iterator = dataset.make_one_shot_iterator()
next_ele = iterator.get_next()
with tf.Session() as sess:
try:
while True:
val = sess.run(next_ele)
print(val)
except tf.errors.OutOfRangeError:
pass

You can skip the code where we create an iterator and print the elements of the Dataset. We will learn about Iterators in detail in the later part of this blog. The output is :

[10 11 12 13 14 15 16 17 18 19]
[20 21 22 23 24 25 26 27 28 29]
[30 31 32 33 34 35 36 37 38 39]
  • Zip: Creates a Dataset by zipping together datasets. Useful in scenarios where you have features and labels and you need to provide the pair of feature and label for training the model.
datax = np.arange(10,20)
datay = np.arange(11,21)
datasetx = tf.data.Dataset.from_tensor_slices(datax)
datasety = tf.data.Dataset.from_tensor_slices(datay)
dcombined = tf.data.Dataset.zip((datasetx, datasety)).batch(2)
iterator = dcombined.make_one_shot_iterator()
next_ele = iterator.get_next()
with tf.Session() as sess:
try:
while True:
val = sess.run(next_ele)
print(val)
except tf.errors.OutOfRangeError:
pass

The output is

(array([10, 11]), array([11, 12]))
(array([12, 13]), array([13, 14]))
(array([14, 15]), array([15, 16]))
(array([16, 17]), array([17, 18]))
(array([18, 19]), array([19, 20]))
  • Repeat: Used to repeat the Dataset.
dataset = tf.data.Dataset.from_tensor_slices(tf.range(10))
dataset = dataset.repeat(count=2)
iterator = dataset.make_one_shot_iterator()
next_ele = iterator.get_next()
with tf.Session() as sess:
try:
while True:
val = sess.run(next_ele)
print(val)
except tf.errors.OutOfRangeError:
pass

The output is

0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
  • Map: Used to transform the elements of the Dataset. Useful in cases where you want to transform your raw data before feeding into the model.
def map_fnc(x):
return x*2;
data = np.arange(10)
dataset = tf.data.Dataset.from_tensor_slices(data)
dataset = dataset.map(map_fnc)
iterator = dataset.make_one_shot_iterator()
next_ele = iterator.get_next()
with tf.Session() as sess:
try:
while True:
val = sess.run(next_ele)
print(val)
except tf.errors.OutOfRangeError:
pass

The output is

0 2 4 6 8 10 12 14 16 18

Creating Iterators

We have learned various ways to create Datasets and apply various transformations to it, but how do we consume the data? Tensorflow provides Iterators to do that.

The iterator is not aware of the number of elements present in the Dataset. It has a get_next function that is used to create an operation in the tensorflow graph and when run over a session, it will return the values from the iterator. Once the Dataset is exhausted, it throws an tf.errors.OutOfRangeError exception.

Let’s look at various Iterators that TensorFlow provides.

  • One-shot iterator: This is the most basic form of iterator. It requires no explicit initialization and iterates over the data only one time and once it gets exhausted, it cannot be re-initialized.
data = np.arange(10,15)
#create the dataset
dataset = tf.data.Dataset.from_tensor_slices(data)
#create the iterator
iterator = dataset.make_one_shot_iterator()
next_element = iterator.get_next()
with tf.Session() as sess:
val = sess.run(next_element)
print(val)
  • Initializable iterator: This iterator requires you to explicitly initialize the iterator by running iterator.initialize. You can define a tf.placeholder and pass data to it dynamically each time you call the initialize operation.
# define two placeholders to accept min and max value
min_val = tf.placeholder(tf.int32, shape=[])
max_val = tf.placeholder(tf.int32, shape=[])
data = tf.range(min_val, max_val)
dataset = tf.data.Dataset.from_tensor_slices(data)
iterator = dataset.make_initializable_iterator()
next_ele = iterator.get_next()
with tf.Session() as sess:

# initialize an iterator with range of values from 10 to 15
sess.run(iterator.initializer, feed_dict={min_val:10, max_val:15})
try:
while True:
val = sess.run(next_ele)
print(val)
except tf.errors.OutOfRangeError:
pass

# initialize an iterator with range of values from 1 to 10
sess.run(iterator.initializer, feed_dict={min_val:1, max_val:10})
try:
while True:
val = sess.run(next_ele)
print(val)
except tf.errors.OutOfRangeError:
pass

  • Reinitializable iterator: This iterator can be initialized from different Dataset objects that have the same structure. Each dataset can pass through its own transformation pipeline.
def map_fnc(ele):
return ele*2
min_val = tf.placeholder(tf.int32, shape=[])
max_val = tf.placeholder(tf.int32, shape=[])
data = tf.range(min_val, max_val)
#Define separate datasets for training and validation
train_dataset = tf.data.Dataset.from_tensor_slices(data)
val_dataset = tf.data.Dataset.from_tensor_slices(data).map(map_fnc)
#create an iterator
iterator=tf.data.Iterator.from_structure(train_dataset.output_types ,train_dataset.output_shapes)
train_initializer = iterator.make_initializer(train_dataset)
val_initializer = iterator.make_initializer(val_dataset)
next_ele = iterator.get_next()
with tf.Session() as sess:

initialize an iterator with range of values from 10 to 15

sess.run(train_initializer, feed_dict={min_val:10, max_val:15})
try:
while True:
val = sess.run(next_ele)
print(val)
except tf.errors.OutOfRangeError:
pass

initialize an iterator with range of values from 1 to 10

sess.run(val_initializer, feed_dict={min_val:1, max_val:10})
try:
while True:
val = sess.run(next_ele)
print(val)
except tf.errors.OutOfRangeError:
pass

  • Feedable iterator: Can be used to switch between Iterators for different Datasets. Useful when you have different Datasets and you want to have more control over which iterator to use over the Dataset.
def map_fnc(ele):
return ele*2
min_val = tf.placeholder(tf.int32, shape=[])
max_val = tf.placeholder(tf.int32, shape=[])
data = tf.range(min_val, max_val)
train_dataset = tf.data.Dataset.from_tensor_slices(data)
val_dataset = tf.data.Dataset.from_tensor_slices(data).map(map_fnc)
train_val_iterator = tf.data.Iterator.from_structure(train_dataset.output_types , train_dataset.output_shapes)
train_initializer = train_val_iterator.make_initializer(train_dataset)
val_initializer = train_val_iterator.make_initializer(val_dataset)
test_dataset = tf.data.Dataset.from_tensor_slices(tf.range(10,15))
test_iterator = test_dataset.make_one_shot_iterator()
handle = tf.placeholder(tf.string, shape=[])
iterator = tf.data.Iterator.from_string_handle(handle, train_dataset.output_types, train_dataset.output_shapes)
next_ele = iterator.get_next()
with tf.Session() as sess:

train_val_handle = sess.run(train_val_iterator.string_handle())
test_handle = sess.run(test_iterator.string_handle())

training

sess.run(train_initializer, feed_dict={min_val:10, max_val:15})
try:
while True:
val = sess.run(next_ele, feed_dict={handle:train_val_handle})
print(val)
except tf.errors.OutOfRangeError:
pass

#validation
sess.run(val_initializer, feed_dict={min_val:1, max_val:10})
try:
while True:
val = sess.run(next_ele, feed_dict={handle:train_val_handle})
print(val)
except tf.errors.OutOfRangeError:
pass

#testing
try:
while True:
val = sess.run(next_ele, feed_dict={handle:test_handle})
print(val)
except tf.errors.OutOfRangeError:
pass

We have learned about various iterators. Let’s apply the knowledge to a practical dataset. We will train the famous MNIST dataset using the LeNet-5 Model. This tutorial will not dive into the details of implementing the LeNet-5 Model as it is beyond the scope of this article.

LeNet-5 Model

Let’s import the MNIST data from the tensorflow library. The MNIST database contains 60,000 training images and 10,000 testing images. Each image is of size 28281. We need to resize it to 32321 for the LeNet-5 Model.

from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/", reshape=False, one_hot = True)
X_train, y_train = mnist.train.images, mnist.train.labels
X_val, y_val = mnist.validation.images, mnist.validation.labels
X_test, y_test = mnist.test.images, mnist.test.labels
X_train = np.pad(X_train, ((0,0), (2,2), (2,2), (0,0)), 'constant')
X_val = np.pad(X_val, ((0,0), (2,2), (2,2), (0,0)), 'constant')
X_test = np.pad(X_test, ((0,0), (2,2), (2,2), (0,0)), 'constant')

Let’s define the forward propagation of the model.

def forward_pass(X):
W1 = tf.get_variable("W1", [5,5,1,6], initializer = tf.contrib.layers.xavier_initializer(seed=0))
# for conv layer2
W2 = tf.get_variable("W2", [5,5,6,16], initializer = tf.contrib.layers.xavier_initializer(seed=0))
Z1 = tf.nn.conv2d(X, W1, strides = [1,1,1,1], padding='VALID')
A1 = tf.nn.relu(Z1)
P1 = tf.nn.max_pool(A1, ksize = [1,2,2,1], strides = [1,2,2,1], padding='VALID')
Z2 = tf.nn.conv2d(P1, W2, strides = [1,1,1,1], padding='VALID')
A2= tf.nn.relu(Z2)
P2= tf.nn.max_pool(A2, ksize = [1,2,2,1], strides=[1,2,2,1], padding='VALID')
P2 = tf.contrib.layers.flatten(P2)

Z3 = tf.contrib.layers.fully_connected(P2, 120)
Z4 = tf.contrib.layers.fully_connected(Z3, 84)
Z5 = tf.contrib.layers.fully_connected(Z4,10, activation_fn= None)
return Z5

Let’s define the model operations

def model(X,Y):

logits = forward_pass(X)
cost = tf.reduce_mean( tf.nn.softmax_cross_entropy_with_logits_v2(logits=logits, labels=Y))
optimizer = tf.train.AdamOptimizer(learning_rate=0.0009)
learner = optimizer.minimize(cost)
correct_predictions = tf.equal(tf.argmax(logits,1),   tf.argmax(Y,1))
accuracy = tf.reduce_mean(tf.cast(correct_predictions, tf.float32))

return (learner, accuracy)

We have now created the model. Before deciding on the Iterator to use for our model, let’s see what are the typical requirements of a machine learning model.

  1. Training the data over batches: Dataset can be very huge. To prevent out of memory errors, we would need to train our dataset in small batches.
  2. Train the model over n passes of the dataset: Typically you want to run your training model over multiple passes of the dataset.
  3. Validate the model at each epoch: You would need to validate your model at each epoch to check your model’s performance.
  4. Finally, test your model on unseen data: After the model is trained, you would like to test your model on unseen data.

Let’s see the pros and cons of each iterator.

  • One-shot iterator: The Dataset can’t be reinitialized once exhausted. To train for more epochs, you would need to repeat the Dataset before feeding to the iterator. This will require huge memory if the size of the data is large. It also doesn’t provide any option to validate the model.
epochs = 10
batch_size = 64
iterations = len(y_train) * epochs
tf.reset_default_graph()
dataset = tf.data.Dataset.from_tensor_slices((X_train, y_train))

need to repeat the dataset for epoch number of times, as all the data needs to be fed to the dataset at once

dataset = dataset.repeat(epochs).batch(batch_size)
iterator = dataset.make_one_shot_iterator()
X_batch , Y_batch = iterator.get_next()
(learner, accuracy) = model(X_batch, Y_batch)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())

total_accuracy = 0
try:
while True:
temp_accuracy, _ = sess.run([accuracy, learner])
total_accuracy += temp_accuracy

except tf.errors.OutOfRangeError:
pass

print('Avg training accuracy is {}'.format((total_accuracy * batch_size) / iterations ))

  • Initializable iterator: You can dynamically change the Dataset between training and validation Datasets. However, in this case both the Datasets needs to go through the same transformation pipeline.
epochs = 10
batch_size = 64
tf.reset_default_graph()
X_data = tf.placeholder(tf.float32, [None, 32,32,1])
Y_data = tf.placeholder(tf.float32, [None, 10])
dataset = tf.data.Dataset.from_tensor_slices((X_data, Y_data))
dataset = dataset.batch(batch_size)
iterator = dataset.make_initializable_iterator()
X_batch , Y_batch = iterator.get_next()
(learner, accuracy) = model(X_batch, Y_batch)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for epoch in range(epochs):

# train the model
sess.run(iterator.initializer, feed_dict={X_data:X_train, Y_data:y_train})
total_train_accuracy = 0
no_train_examples = len(y_train)
try:
  while True:
    temp_train_accuracy, _ = sess.run([accuracy, learner])
    total_train_accuracy += temp_train_accuracy*batch_size
except tf.errors.OutOfRangeError:
  pass

# validate the model
sess.run(iterator.initializer, feed_dict={X_data:X_val, Y_data:y_val})
total_val_accuracy = 0
no_val_examples = len(y_val)
try:
  while True:
    temp_val_accuracy = sess.run(accuracy)
    total_val_accuracy += temp_val_accuracy*batch_size
except tf.errors.OutOfRangeError:
  pass

print('Epoch {}'.format(str(epoch+1)))
print("---------------------------")
print('Training accuracy is {}'.format(total_train_accuracy/no_train_examples))
print('Validation accuracy is {}'.format(total_val_accuracy/no_val_examples))

  • Re-initializable iterator: This iterator overcomes the problem of initializable iterator by using two separate Datasets. Each dataset can go through its own preprocessing pipeline. The iterator can be created using the tf.Iterator.from_structure method.
def map_fnc(X, Y):
return X, Y
epochs = 10
batch_size = 64
tf.reset_default_graph()
X_data = tf.placeholder(tf.float32, [None, 32,32,1])
Y_data = tf.placeholder(tf.float32, [None, 10])
train_dataset = tf.data.Dataset.from_tensor_slices((X_data, Y_data)).batch(batch_size).map(map_fnc)
val_dataset = tf.data.Dataset.from_tensor_slices((X_data, Y_data)).batch(batch_size)
iterator = tf.data.Iterator.from_structure(train_dataset.output_types, train_dataset.output_shapes)
X_batch , Y_batch = iterator.get_next()
(learner, accuracy) = model(X_batch, Y_batch)
train_initializer = iterator.make_initializer(train_dataset)
val_initializer = iterator.make_initializer(val_dataset)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for epoch in range(epochs):

# train the model
sess.run(train_initializer, feed_dict={X_data:X_train, Y_data:y_train})
total_train_accuracy = 0
no_train_examples = len(y_train)
try:
  while True:
    temp_train_accuracy, _ = sess.run([accuracy, learner])
    total_train_accuracy += temp_train_accuracy*batch_size
except tf.errors.OutOfRangeError:
  pass

# validate the model
sess.run(val_initializer, feed_dict={X_data:X_val, Y_data:y_val})
total_val_accuracy = 0
no_val_examples = len(y_val)
try:
  while True:
    temp_val_accuracy = sess.run(accuracy)
    total_val_accuracy += temp_val_accuracy*batch_size
except tf.errors.OutOfRangeError:
  pass

print('Epoch {}'.format(str(epoch+1)))
print("---------------------------")
print('Training accuracy is {}'.format(total_train_accuracy/no_train_examples))
print('Validation accuracy is {}'.format(total_val_accuracy/no_val_examples))

  • Feedable iterator: This iterator provides the option of switching between various iterators. You can create a re-initializable iterator for training and validation purposes. For inference/testing where you require one pass of the dataset, you can use the one shot iterator.
epochs = 10
batch_size = 64
tf.reset_default_graph()
X_data = tf.placeholder(tf.float32, [None, 32,32,1])
Y_data = tf.placeholder(tf.float32, [None, 10])
train_dataset = tf.data.Dataset.from_tensor_slices((X_data, Y_data)).batch(batch_size)
val_dataset = tf.data.Dataset.from_tensor_slices((X_data, Y_data)).batch(batch_size)
test_dataset = tf.data.Dataset.from_tensor_slices((X_test, y_test.astype(np.float32)).batch(batch_size)
handle = tf.placeholder(tf.string, shape=[])
iterator = tf.data.Iterator.from_string_handle(handle, train_dataset.output_types, train_dataset.output_shapes)
X_batch , Y_batch = iterator.get_next()
(learner, accuracy) = model(X_batch, Y_batch)
train_val_iterator = tf.data.Iterator.from_structure(train_dataset.output_types, train_dataset.output_shapes)
train_iterator = train_val_iterator.make_initializer(train_dataset)
val_iterator = train_val_iterator.make_initializer(val_dataset)
test_iterator = test_dataset.make_one_shot_iterator()
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
train_val_string_handle = sess.run(train_val_iterator.string_handle())
test_string_handle = sess.run(test_iterator.string_handle())

for epoch in range(epochs):

# train the model
sess.run(train_iterator, feed_dict={X_data:X_train, Y_data:y_train})
total_train_accuracy = 0
no_train_examples = len(y_train)
try:
  while True:
    temp_train_accuracy, _ = sess.run([accuracy, learner], feed_dict={handle:train_val_string_handle})
    total_train_accuracy += temp_train_accuracy*batch_size
except tf.errors.OutOfRangeError:
  pass

# validate the model
sess.run(val_iterator, feed_dict={X_data:X_val, Y_data:y_val})
total_val_accuracy = 0
no_val_examples = len(y_val)
try:
  while True:
    temp_val_accuracy, _ = sess.run([accuracy, learner], feed_dict={handle:train_val_string_handle})
    total_val_accuracy += temp_val_accuracy*batch_size
except tf.errors.OutOfRangeError:
  pass

print('Epoch {}'.format(str(epoch+1)))
print("---------------------------")
print('Training accuracy is {}'.format(total_train_accuracy/no_train_examples))
print('Validation accuracy is {}'.format(total_val_accuracy/no_val_examples))

print("Testing the model --------")

total_test_accuracy = 0
no_test_examples = len(y_test)
try:
while True:
temp_test_accuracy, _ = sess.run([accuracy, learner], feed_dict={handle:test_string_handle})
total_test_accuracy += temp_test_accuracy*batch_size
except tf.errors.OutOfRangeError:
pass

print('Testing accuracy is {}'.format(total_test_accuracy/no_test_examples))

Thanks for reading the blog. The code examples used in this blog can be found in this jupyter notebook.

Do leave your comments below if you have any questions or if you have any suggestions for improving this blog.

Originally published by Animesh Agarwal at https://towardsdatascience.com/

Learn More

☞ Complete Guide to TensorFlow for Deep Learning with Python

☞ Modern Deep Learning in Python

☞ TensorFlow 101: Introduction to Deep Learning

☞ Tensorflow Bootcamp For Data Science in Python

☞ Machine Learning A-Z™: Hands-On Python & R In Data Science

☞ Data Science A-Z™: Real-Life Data Science Exercises Included

Learn Data Science | How to Learn Data Science for Free

Learn Data Science | How to Learn Data Science for Free

Learn Data Science | How to Learn Data Science for Free. In this post, I have described a learning path and free online courses and tutorials that will enable you to learn data science for free.

The average cost of obtaining a masters degree at traditional bricks and mortar institutions will set you back anywhere between $30,000 and $120,000. Even online data science degree programs don’t come cheap costing a minimum of $9,000. So what do you do if you want to learn data science but can’t afford to pay this?

I trained into a career as a data scientist without taking any formal education in the subject. In this article, I am going to share with you my own personal curriculum for learning data science if you can’t or don’t want to pay thousands of dollars for more formal study.

The curriculum will consist of 3 main parts, technical skills, theory and practical experience. I will include links to free resources for every element of the learning path and will also be including some links to additional ‘low cost’ options. So if you want to spend a little money to accelerate your learning you can add these resources to the curriculum. I will include the estimated costs for each of these.

Technical skills

The first part of the curriculum will focus on technical skills. I recommend learning these first so that you can take a practical first approach rather than say learning the mathematical theory first. Python is by far the most widely used programming language used for data science. In the Kaggle Machine Learning and Data Science survey carried out in 2018 83% of respondents said that they used Python on a daily basis. I would, therefore, recommend focusing on this language but also spending a little time on other languages such as R.

Python Fundamentals

Before you can start to use Python for data science you need a basic grasp of the fundamentals behind the language. So you will want to take a Python introductory course. There are lots of free ones out there but I like the Codeacademy ones best as they include hands-on in-browser coding throughout.

I would suggest taking the introductory course to learn Python. This covers basic syntax, functions, control flow, loops, modules and classes.

Data analysis with python

Next, you will want to get a good understanding of using Python for data analysis. There are a number of good resources for this.

To start with I suggest taking at least the free parts of the data analyst learning path on dataquest.io. Dataquest offers complete learning paths for data analyst, data scientist and data engineer. Quite a lot of the content, particularly on the data analyst path is available for free. If you do have some money to put towards learning then I strongly suggest putting it towards paying for a few months of the premium subscription. I took this course and it provided a fantastic grounding in the fundamentals of data science. It took me 6 months to complete the data scientist path. The price varies from $24.50 to $49 per month depending on whether you pay annually or not. It is better value to purchase the annual subscription if you can afford it.

The Dataquest platform

Python for machine learning

If you have chosen to pay for the full data science course on Dataquest then you will have a good grasp of the fundamentals of machine learning with Python. If not then there are plenty of other free resources. I would focus to start with on scikit-learn which is by far the most commonly used Python library for machine learning.

When I was learning I was lucky enough to attend a two-day workshop run by Andreas Mueller one of the core developers of scikit-learn. He has however published all the material from this course, and others, on this Github repo. These consist of slides, course notes and notebooks that you can work through. I would definitely recommend working through this material.

Then I would suggest taking some of the tutorials in the scikit-learn documentation. After that, I would suggest building some practical machine learning applications and learning the theory behind how the models work — which I will cover a bit later on.

SQL

SQL is a vital skill to learn if you want to become a data scientist as one of the fundamental processes in data modelling is extracting data in the first place. This will more often than not involve running SQL queries against a database. Again if you haven’t opted to take the full Dataquest course then here are a few free resources to learn this skill.

Codeacamdemy has a free introduction to SQL course. Again this is very practical with in-browser coding all the way through. If you also want to learn about cloud-based database querying then Google Cloud BigQuery is very accessible. There is a free tier so you can try queries for free, an extensive range of public datasets to try and very good documentation.

Codeacademy SQL course

R

To be a well-rounded data scientist it is a good idea to diversify a little from just Python. I would, therefore, suggest also taking an introductory course in R. Codeacademy have an introductory course on their free plan. It is probably worth noting here that similar to Dataquest Codeacademy also offers a complete data science learning plan as part of their pro account (this costs from $31.99 to $15.99 per month depending on how many months you pay for up front). I personally found the Dataquest course to be much more comprehensive but this may work out a little cheaper if you are looking to follow a learning path on a single platform.

Software engineering

It is a good idea to get a grasp of software engineering skills and best practices. This will help your code to be more readable and extensible both for yourself and others. Additionally, when you start to put models into production you will need to be able to write good quality well-tested code and work with tools like version control.

There are two great free resources for this. Python like you mean it covers things like the PEP8 style guide, documentation and also covers object-oriented programming really well.

The scikit-learn contribution guidelines, although written to facilitate contributions to the library, actually cover the best practices really well. This covers topics such as Github, unit testing and debugging and is all written in the context of a data science application.

Deep learning

For a comprehensive introduction to deep learning, I don’t think that you can get any better than the totally free and totally ad-free fast.ai. This course includes an introduction to machine learning, practical deep learning, computational linear algebra and a code-first introduction to natural language processing. All their courses have a practical first approach and I highly recommend them.

Fast.ai platform

Theory

Whilst you are learning the technical elements of the curriculum you will encounter some of the theory behind the code you are implementing. I recommend that you learn the theoretical elements alongside the practical. The way that I do this is that I learn the code to be able to implement a technique, let’s take KMeans as an example, once I have something working I will then look deeper into concepts such as inertia. Again the scikit-learn documentation contains all the mathematical concepts behind the algorithms.

In this section, I will introduce the key foundational elements of theory that you should learn alongside the more practical elements.

The khan academy covers almost all the concepts I have listed below for free. You can tailor the subjects you would like to study when you sign up and you then have a nice tailored curriculum for this part of the learning path. Checking all of the boxes below will give you an overview of most elements I have listed below.

Maths

Calculus

Calculus is defined by Wikipedia as “the mathematical study of continuous change.” In other words calculus can find patterns between functions, for example, in the case of derivatives, it can help you to understand how a function changes over time.

Many machine learning algorithms utilise calculus to optimise the performance of models. If you have studied even a little machine learning you will probably have heard of Gradient descent. This functions by iteratively adjusting the parameter values of a model to find the optimum values to minimise the cost function. Gradient descent is a good example of how calculus is used in machine learning.

What you need to know:

Derivatives

  • Geometric definition
  • Calculating the derivative of a function
  • Nonlinear functions

Chain rule

  • Composite functions
  • Composite function derivatives
  • Multiple functions

Gradients

  • Partial derivatives
  • Directional derivatives
  • Integrals

Linear Algebra

Many popular machine learning methods, including XGBOOST, use matrices to store inputs and process data. Matrices alongside vector spaces and linear equations form the mathematical branch known as Linear Algebra. In order to understand how many machine learning methods work it is essential to get a good understanding of this field.

What you need to learn:

Vectors and spaces

  • Vectors
  • Linear combinations
  • Linear dependence and independence
  • Vector dot and cross products

Matrix transformations

  • Functions and linear transformations
  • Matrix multiplication
  • Inverse functions
  • Transpose of a matrix

Statistics

Here is a list of the key concepts you need to know:

Descriptive/Summary statistics

  • How to summarise a sample of data
  • Different types of distributions
  • Skewness, kurtosis, central tendency (e.g. mean, median, mode)
  • Measures of dependence, and relationships between variables such as correlation and covariance

Experiment design

  • Hypothesis testing
  • Sampling
  • Significance tests
  • Randomness
  • Probability
  • Confidence intervals and two-sample inference

Machine learning

  • Inference about slope
  • Linear and non-linear regression
  • Classification

Practical experience

The third section of the curriculum is all about practice. In order to truly master the concepts above you will need to use the skills in some projects that ideally closely resemble a real-world application. By doing this you will encounter problems to work through such as missing and erroneous data and develop a deep level of expertise in the subject. In this last section, I will list some good places you can get this practical experience from for free.

“With deliberate practice, however, the goal is not just to reach your potential but to build it, to make things possible that were not possible before. This requires challenging homeostasis — getting out of your comfort zone — and forcing your brain or your body to adapt.”, Anders Ericsson, Peak: Secrets from the New Science of Expertise

Kaggle, et al

Machine learning competitions are a good place to get practice with building machine learning models. They give access to a wide range of data sets, each with a specific problem to solve and have a leaderboard. The leaderboard is a good way to benchmark how good your knowledge at developing a good model actually is and where you may need to improve further.

In addition to Kaggle, there are other platforms for machine learning competitions including Analytics Vidhya and DrivenData.

Driven data competitions page

UCI Machine Learning Repository

The UCI machine learning repository is a large source of publically available data sets. You can use these data sets to put together your own data projects this could include data analysis and machine learning models, you could even try building a deployed model with a web front end. It is a good idea to store your projects somewhere publically such as Github as this can create a portfolio showcasing your skills to use for future job applications.


UCI repository

Contributions to open source

One other option to consider is contributing to open source projects. There are many Python libraries that rely on the community to maintain them and there are often hackathons held at meetups and conferences where even beginners can join in. Attending one of these events would certainly give you some practical experience and an environment where you can learn from others whilst giving something back at the same time. Numfocus is a good example of a project like this.

In this post, I have described a learning path and free online courses and tutorials that will enable you to learn data science for free. Showcasing what you are able to do in the form of a portfolio is a great tool for future job applications in lieu of formal qualifications and certificates. I really believe that education should be accessible to everyone and, certainly, for data science at least, the internet provides that opportunity. In addition to the resources listed here, I have previously published a recommended reading list for learning data science available here. These are also all freely available online and are a great way to complement the more practical resources covered above.

Thanks for reading!

Data Science vs Data Analytics vs Big Data

Data Science vs Data Analytics vs Big Data

When we talk about data processing, Data Science vs Big Data vs Data Analytics are the terms that one might think of and there has always been a confusion between them. In this article on Data science vs Big Data vs Data Analytics, I will understand the similarities and differences between them

When we talk about data processing, Data Science vs Big Data vs Data Analytics are the terms that one might think of and there has always been a confusion between them. In this article on Data science vs Big Data vs Data Analytics, I will understand the similarities and differences between them

We live in a data-driven world. In fact, the amount of digital data that exists is growing at a rapid rate, doubling every two years, and changing the way we live. Now that Hadoop and other frameworks have resolved the problem of storage, the main focus on data has shifted to processing this huge amount of data. When we talk about data processing, Data Science vs Big Data vs Data Analytics are the terms that one might think of and there has always been a confusion between them.

In this article on Data Science vs Data Analytics vs Big Data, I will be covering the following topics in order to make you understand the similarities and differences between them.
Introduction to Data Science, Big Data & Data AnalyticsWhat does Data Scientist, Big Data Professional & Data Analyst do?Skill-set required to become Data Scientist, Big Data Professional & Data AnalystWhat is a Salary Prospect?Real time Use-case## Introduction to Data Science, Big Data, & Data Analytics

Let’s begin by understanding the terms Data Science vs Big Data vs Data Analytics.

What Is Data Science?

Data Science is a blend of various tools, algorithms, and machine learning principles with the goal to discover hidden patterns from the raw data.

[Source: gfycat.com]

It also involves solving a problem in various ways to arrive at the solution and on the other hand, it involves to design and construct new processes for data modeling and production using various prototypes, algorithms, predictive models, and custom analysis.

What is Big Data?

Big Data refers to the large amounts of data which is pouring in from various data sources and has different formats. It is something that can be used to analyze the insights which can lead to better decisions and strategic business moves.

[Source: gfycat.com]

What is Data Analytics?

Data Analytics is the science of examining raw data with the purpose of drawing conclusions about that information. It is all about discovering useful information from the data to support decision-making. This process involves inspecting, cleansing, transforming & modeling data.

[Source: ibm.com]

What Does Data Scientist, Big Data Professional & Data Analyst Do?

What does a Data Scientist do?

Data Scientists perform an exploratory analysis to discover insights from the data. They also use various advanced machine learning algorithms to identify the occurrence of a particular event in the future. This involves identifying hidden patterns, unknown correlations, market trends and other useful business information.

Roles of Data Scientist

What do Big Data Professionals do?

The responsibilities of big data professional lies around dealing with huge amount of heterogeneous data, which is gathered from various sources coming in at a high velocity.

Roles of Big Data Professiona

Big data professionals describe the structure and behavior of a big data solution and how it can be delivered using big data technologies such as Hadoop, Spark, Kafka etc. based on requirements.

What does a Data Analyst do?

Data analysts translate numbers into plain English. Every business collects data, like sales figures, market research, logistics, or transportation costs. A data analyst’s job is to take that data and use it to help companies to make better business decisions.

Roles of Data Analyst

Skill-Set Required To Become Data Scientist, Big Data Professional, & Data Analyst

What Is The Salary Prospect?

The below figure shows the average salary structure of **Data Scientist, Big Data Specialist, **and Data Analyst.

A Scenario Illustrating The Use Of Data Science vs Big Data vs Data Analytics.

Now, let’s try to understand how can we garner benefits by combining all three of them together.

Let’s take an example of Netflix and see how they join forces in achieving the goal.

First, let’s understand the role of* Big Data Professional* in Netflix example.

Netflix generates a huge amount of unstructured data in forms of text, audio, video files and many more. If we try to process this dark (unstructured) data using the traditional approach, it becomes a complicated task.

Approach in Netflix

Traditional Data Processing

Hence a Big Data Professional designs and creates an environment using Big Data tools to ease the processing of Netflix Data.

Big Data approach to process Netflix data

Now, let’s see how Data Scientist Optimizes the Netflix Streaming experience.

Role of Data Scientist in Optimizing the Netflix streaming experience

1. Understanding the impact of QoE on user behavior

User behavior refers to the way how a user interacts with the Netflix service, and data scientists use the data to both understand and predict behavior. For example, how would a change to the Netflix product affect the number of hours that members watch? To improve the streaming experience, Data Scientists look at QoE metrics that are likely to have an impact on user behavior. One metric of interest is the rebuffer rate, which is a measure of how often playback is temporarily interrupted. Another metric is bitrate, that refers to the quality of the picture that is served/seen — a very low bitrate corresponds to a fuzzy picture.

2. Improving the streaming experience

How do Data Scientists use data to provide the best user experience once a member hits “play” on Netflix?

One approach is to look at the algorithms that run in real-time or near real-time once playback has started, which determine what bitrate should be served, what server to download that content from, etc.

For example, a member with a high-bandwidth connection on a home network could have very different expectations and experience compared to a member with low bandwidth on a mobile device on a cellular network.

By determining all these factors one can improve the streaming experience.

3. Optimize content caching

A set of big data problems also exists on the content delivery side.

The key idea here is to locate the content closer (in terms of network hops) to Netflix members to provide a great experience. By viewing the behavior of the members being served and the experience, one can optimize the decisions around content caching.

4. Improving content quality

Another approach to improving user experience involves looking at the quality of content, i.e. the video, audio, subtitles, closed captions, etc. that are part of the movie or show. Netflix receives content from the studios in the form of digital assets that are then encoded and quality checked before they go live on the content servers.

In addition to the internal quality checks, Data scientists also receive feedback from our members when they discover issues while viewing.

By combining member feedback with intrinsic factors related to viewing behavior, they build the models to predict whether a particular piece of content has a quality issue. Machine learning models along with natural language processing (NLP) and text mining techniques can be used to build powerful models to both improve the quality of content that goes live and also use the information provided by the Netflix users to close the loop on quality and replace content that does not meet the expectations of the users.

So this is how Data Scientist optimizes the Netflix streaming experience.

Now let’s understand how Data Analytics is used to drive the Netflix success.

Role of Data Analyst in Netflix

The above figure shows the different types of users who watch the video/play on Netflix. Each of them has their own choices and preferences.

So what does a Data Analyst do?

Data Analyst creates a user stream based on the preferences of users. For example, if user 1 and user 2 have the same preference or a choice of video, then data analyst creates a user stream for those choices. And also –
Orders the Netflix collection for each member profile in a personalized way.We know that the same genre row for each member has an entirely different selection of videos.Picks out the top personalized recommendations from the entire catalog, focusing on the titles that are top on ranking.By capturing all events and user activities on Netflix, data analyst pops out the trending video.Sorts the recently watched titles and estimates whether the member will continue to watch or rewatch or stop watching etc.
I hope you have *understood *the *differences *& *similarities *between Data Science vs Big Data vs Data Analytics.

Data Science Training | Data Science for Beginners | Intellipaat - YouTube

Data Science Training  | Data Science for Beginners | Intellipaat - YouTube

In this data science for beginners video you will learn complete data science starting from what is data science to various data science concepts, data science projects and data science interview questions & answers. This is an end to end data science full course.

Why Data Science is important?

Data Science is taking over each and every industry domain. Machine Learning and especially Deep Learning are the most important aspects of Data Science that are being deployed everywhere from search engines to online movie recommendations. Taking the Intellipaat Data Science training & Data Science Course can help professionals to build a solid career in a rising technology domain and get the best jobs in top organizations.

Why should you opt for a Data Science career?

If you want to fast-track your career then you should strongly consider Data Science. The reason for this is that it is one of the fastest growing technology. There is a huge demand for Data Scientist. The salaries for Data Scientist is fantastic.There is a huge growth opportunity in this domain as well. Hence this Intellipaat Data Science project tutorial is your stepping stone to a successful career!