Thousands of CSV files, Keras and TensorFlow. That’s how real machine learning looks like! I hope that I will save you time telling how to train NNs using generators, tf.data.Dataset, and other pretty interesting stuff.
I have about 15 000 CSV files with 6 000 rows in each, and I need to train a neural network using all these data: about 90 000 000 instances totally. That’s how real machine learning looks like!
I hope that I will save you time telling how to train NNs using
tf.data.Dataset, and other pretty interesting stuff.
They go for you. Image author: Denis Shilov (that’s me).
There is no way to concat all these data into one file or in one array, as it will be something huge.
So the only way to handle this huge data array is to do it by batches:
Chilling, that’s really easy. Important notice: all files should be in one directory for this code to work.
INPUT_DATA_DIR = "split_together_passed/" TRAIN_DATA_COEFFICIENT = 0.75 files =  for (dirpath, dirnames, filenames) in walk(INPUT_DATA_DIR): files.extend(filenames) break train_files_finish = int(len(files) * TRAIN_DATA_COEFFICIENT) train_files = files[0:train_files_finish] validation_files = files[train_files_finish:len(files)]
There are several approaches for doing handling by batches.
For example, one of the approaches is to use
generator. It is a function that returns an iterator, and we can iterate through its values: one value at a time.
Keras allows us to pass this
.fit by default.
So let’s write that iterator!
def generate_batches(files, batch_size): counter = 0 while True: fname = files[counter] counter = (counter + 1) % len(files) frame = pd.read_csv(INPUT_DATA_DIR + fname) ## here is your preprocessing input = ## so you init input here somehow output = ## so you init output here for local_index in range(0, input.shape, batch_size): input_local = input[local_index:(local_index + batch_size)] output_local = output[local_index:(local_index + batch_size)] yield input_local, output_local
Pretty self-explaining: you pass in the
files array and the
batch_size , and corresponding
output are now returned.
Then you can init your generators the following way:
batch_size = 18 ** 3 train_generator = generate_batches(files=train_files, batch_size=batch_size) test_generator = generate_batches(files=validation_files, batch_size=batch_size)
batch_size is how many rows will be returned at once. Typically you can init it like the number of rows in a single CSV, but if this number is too enormous, then set something not so enormous (I don’t know, 5 000, for example).
callback_list = [EarlyStopping(monitor='val_loss', patience=25)] hist = model.fit( steps_per_epoch=len(train_files), use_multiprocessing=True, workers=6, x=train_generator, verbose=1, max_queue_size=32, epochs=100, callbacks=callback_list, validation_data=test_generator, validation_steps=len(validation_files) )
And you fit a model.
callback_list is a thing which monitors if some parameter of training starts to decrease too slow, and there is no reason to continue training.
steps_per_epoch tells when it is necessary to start a new epoch. In case you don’t provide it Keras won’t know the length of your data and will print “Unknown” in the log.
use_multiprocessing indicates if you want to process data in several threads.
workers is a number of such threads. This number should be less than the number of cores of your CPU.
x is your generator. As it returns both input and output, then we don’t set
verbose is how much log is detailed.
As you have several processes preprocessing data for training, they add these data somewhere for Keras to take them and train NN. And
max_queue_size specifies the limit of the number of data stored but not yet processed. You should set it, as there is no need to preprocess more data than Keras can consume at once: your dataset is huge and the memory will be overloaded.
epochs is the number of iterations Keras will do through your dataset
validation_data is the data you’ll which will be used to validate the accuracy.
validation_steps has the same meaning as
We supply you with world class machine learning experts / ML Developers with years of domain experience who can add more value to your business.
Most popular Data Science and Machine Learning courses — August 2020. This list was last updated in August 2020 — and will be updated regularly so as to keep it relevant
Learning is a new fun in the field of Machine Learning and Data Science. In this article, we’ll be discussing 15 machine learning and data science projects.
This post will help you in finding different websites where you can easily get free Datasets to practice and develop projects in Data Science and Machine Learning.
In this article, I will take you through a full machine learning course for free. This machine learning course is for those who are learning