Big Data

Keras: Training on Large Datasets

[I’ve started writing these short articles as a series of ‘note-to-self’ reminders for myself. I’m sharing them hoping that someone else finds them useful too]

If you have ever tried to train a network with tons of data (e.g., images) which do not fit in memory, you must have discovered that you need to use some sort of data generator which creates data in batches and feed it to your network for training.

TensorFlow has its own generator, but the API is unnecessarily complex and it’s easy to get it wrong. I’m a big fan of Keras because it removes the overhead of learning a cumbersome API (that’ll probably change in the next few months, anyway) and makes it possible to focus on your design.

Another advantage of using Sequence class in Keras as batch generator is that Keras handles all the multi-threading and parallelization to ensure that (as much as possible), your training (backprop) does not have to wait for batch generation. It does this behind the scene by fetching the batches ahead of time using multiple CPU cores.

Use Sequence Class from Keras

Keras documentation has the perfect example (I just refactored some variable names to make it more clear):

from skimage.io import imread
from skimage.transform import resize
import numpy as np

class MY_Generator(Sequence):

    def __init__(self, image_filenames, labels, batch_size):
        self.image_filenames, self.labels = image_filenames, labels
        self.batch_size = batch_size

    def __len__(self):
        return np.ceil(len(self.image_filenames) / float(self.batch_size))

    def __getitem__(self, idx):
        batch_x = self.image_filenames[idx * self.batch_size:(idx + 1) * self.batch_size]
        batch_y = self.labels[idx * self.batch_size:(idx + 1) * self.batch_size]

        return np.array([
            resize(imread(file_name), (200, 200))
               for file_name in batch_x]), np.array(batch_y)

Let’s go through this example and see what it does:

Line 5: Our generator class inherits from the Sequence class.

Line 7: We can feed parameters to our generator here. In this example, we pass image filenames as self.image_filenames and their corresponding labels as self.labels and the batch size as self.batch_size

Line 11: This function computes the number of batches that this generator is supposed to produce. So, we divide the number of total samples by the batch_size and return that value.

Line 14: Here, given the batch numberidx you need to put together a list that consists of data batch and the ground-truth (GT). In this example, we read a batch images of size self.batch and return an array of form[image_batch, GT].

Ok, so we have created our data generator. Next step is to create instances of this class and feed them to fit_generator :

   my_training_batch_generator = My_Generator(training_filenames, GT_training, batch_size)
   my_validation_batch_generator = My_Generator(validation_filenames, GT_validation, batch_size)

   model.fit_generator(generator=my_training_batch_generator,
                                          steps_per_epoch=(num_training_samples // batch_size),
                                          epochs=num_epochs,
                                          verbose=1,
                                          validation_data=my_validation_batch_generator,
                                          validation_steps=(num_validation_samples // batch_size),
                                          use_multiprocessing=True,
                                          workers=16,
                                          max_queue_size=32)

Lines 1,2: Instantiate two instances of My_Generator (one for training and one for validation) and initialize them with image filenames for training and validation and the ground-truth for training and validation sets.

Line 4: Feed the two instances of My_Generator class created above and feed it to fit_generator. That’s it; we are done!

If you have multiple CPU cores, set the use_multiprocessing to True so that generators can run in parallel on CPU. Setworkers to the number of CPU cores that you want to be allocated to batch generation.

Last updated