Big Data
Last updated
Last updated
Ramin RezaiifarFollowMar 18, 2018
[I’ve started writing these short articles as a series of ‘note-to-self’ reminders for myself. I’m sharing them hoping that someone else finds them useful too]
If you have ever tried to train a network with tons of data (e.g., images) which do not fit in memory, you must have discovered that you need to use some sort of data generator which creates data in batches and feed it to your network for training.
TensorFlow has its own generator, but the API is unnecessarily complex and it’s easy to get it wrong. I’m a big fan of Keras because it removes the overhead of learning a cumbersome API (that’ll probably change in the next few months, anyway) and makes it possible to focus on your design.
Another advantage of using Sequence
class in Keras as batch generator is that Keras handles all the multi-threading and parallelization to ensure that (as much as possible), your training (backprop) does not have to wait for batch generation. It does this behind the scene by fetching the batches ahead of time using multiple CPU cores.
Keras documentation has the perfect example (I just refactored some variable names to make it more clear):
Let’s go through this example and see what it does:
Line 5: Our generator class inherits from the Sequence class.
Line 7: We can feed parameters to our generator here. In this example, we pass image filenames as self.image_filenames
and their corresponding labels as self.labels
and the batch size as self.batch_size
Line 11: This function computes the number of batches that this generator is supposed to produce. So, we divide the number of total samples by the batch_size and return that value.
Line 14: Here, given the batch numberidx
you need to put together a list that consists of data batch and the ground-truth (GT). In this example, we read a batch images of size self.batch
and return an array of form[image_batch, GT]
.
Ok, so we have created our data generator. Next step is to create instances of this class and feed them to fit_generator
:
Lines 1,2: Instantiate two instances of My_Generator (one for training and one for validation) and initialize them with image filenames for training and validation and the ground-truth for training and validation sets.
Line 4: Feed the two instances of My_Generator
class created above and feed it to fit_generator
. That’s it; we are done!
If you have multiple CPU cores, set the use_multiprocessing
to True
so that generators can run in parallel on CPU. Setworkers
to the number of CPU cores that you want to be allocated to batch generation.