Thousands of CSV files, Keras and TensorFlow

Thousands of CSV files, Keras and TensorFlow

Thousands of CSV files, Keras and TensorFlow. That’s how real machine learning looks like! I hope that I will save you time telling how to train NNs using generators, tf.data.Dataset, and other pretty interesting stuff.

I have about 15 000 CSV files with 6 000 rows in each, and I need to train a neural network using all these data: about 90 000 000 instances totally. That’s how real machine learning looks like!

I hope that I will save you time telling how to train NNs using generators, tf.data.Dataset, and other pretty interesting stuff.

Image for post

They go for you. Image author: Denis Shilov (that’s me).

Intro

There is no way to concat all these data into one file or in one array, as it will be something huge.

So the only way to handle this huge data array is to do it by batches:

  1. We get a list of files
  2. We split it into training and testing datasets
  3. Then we use something to put data by batches to Keras

Easy part

Chilling, that’s really easy. Important notice: all files should be in one directory for this code to work.

INPUT_DATA_DIR = "split_together_passed/"
TRAIN_DATA_COEFFICIENT = 0.75

files = []

for (dirpath, dirnames, filenames) in walk(INPUT_DATA_DIR):
    files.extend(filenames)
    break

train_files_finish = int(len(files) * TRAIN_DATA_COEFFICIENT)
train_files = files[0:train_files_finish]
validation_files = files[train_files_finish:len(files)]

Tricky part #1

There are several approaches for doing handling by batches.

For example, one of the approaches is to use generator. It is a function that returns an iterator, and we can iterate through its values: one value at a time.

Keras allows us to pass this generator to .fit by default.

So let’s write that iterator!

def generate_batches(files, batch_size):
    counter = 0

    while True:
        fname = files[counter]

        counter = (counter + 1) % len(files)
        frame = pd.read_csv(INPUT_DATA_DIR + fname)

        ## here is your preprocessing

        input = ## so you init input here somehow
        output = ## so you init output here

        for local_index in range(0, input.shape[0], batch_size):
            input_local = input[local_index:(local_index + batch_size)]
            output_local = output[local_index:(local_index + batch_size)]

            yield input_local, output_local

Pretty self-explaining: you pass in the files array and the batch_size , and corresponding input and output are now returned.

Then you can init your generators the following way:

batch_size = 18 ** 3
train_generator = generate_batches(files=train_files, batch_size=batch_size)
test_generator = generate_batches(files=validation_files, batch_size=batch_size)

batch_size is how many rows will be returned at once. Typically you can init it like the number of rows in a single CSV, but if this number is too enormous, then set something not so enormous (I don’t know, 5 000, for example).

callback_list = [EarlyStopping(monitor='val_loss', patience=25)]

hist = model.fit(
    steps_per_epoch=len(train_files),
    use_multiprocessing=True,
    workers=6,
    x=train_generator,
    verbose=1,
    max_queue_size=32,
    epochs=100,
    callbacks=callback_list,
    validation_data=test_generator,
    validation_steps=len(validation_files)
)

And you fit a model.

callback_list is a thing which monitors if some parameter of training starts to decrease too slow, and there is no reason to continue training.

steps_per_epoch tells when it is necessary to start a new epoch. In case you don’t provide it Keras won’t know the length of your data and will print “Unknown” in the log.

use_multiprocessing indicates if you want to process data in several threads.

workers is a number of such threads. This number should be less than the number of cores of your CPU.

x is your generator. As it returns both input and output, then we don’t set y.

verbose is how much log is detailed.

As you have several processes preprocessing data for training, they add these data somewhere for Keras to take them and train NN. And max_queue_size specifies the limit of the number of data stored but not yet processed. You should set it, as there is no need to preprocess more data than Keras can consume at once: your dataset is huge and the memory will be overloaded.

epochs is the number of iterations Keras will do through your dataset

validation_data is the data you’ll which will be used to validate the accuracy.

validation_steps has the same meaning as steps_per_epoch.

data-science machine-learning tensorflow keras developer

Bootstrap 5 Complete Course with Examples

Bootstrap 5 Tutorial - Bootstrap 5 Crash Course for Beginners

Nest.JS Tutorial for Beginners

Hello Vue 3: A First Look at Vue 3 and the Composition API

Building a simple Applications with Vue 3

Deno Crash Course: Explore Deno and Create a full REST API with Deno

How to Build a Real-time Chat App with Deno and WebSockets

Convert HTML to Markdown Online

HTML entity encoder decoder Online

Hire Machine Learning Developers in India

We supply you with world class machine learning experts / ML Developers with years of domain experience who can add more value to your business.

Most popular Data Science and Machine Learning courses — July 2020

Most popular Data Science and Machine Learning courses — August 2020. This list was last updated in August 2020 — and will be updated regularly so as to keep it relevant

15 Machine Learning and Data Science Project Ideas with Datasets

Learning is a new fun in the field of Machine Learning and Data Science. In this article, we’ll be discussing 15 machine learning and data science projects.

Best Free Datasets for Data Science and Machine Learning Projects

This post will help you in finding different websites where you can easily get free Datasets to practice and develop projects in Data Science and Machine Learning.

Machine Learning Course | Data Science | Machine Learning | Python

In this article, I will take you through a full machine learning course for free. This machine learning course is for those who are learning