# Big Data

## Keras: Training on Large Datasets <a href="#d39b" id="d39b"></a>

[![Go to the profile of Ramin Rezaiifar](https://cdn-images-1.medium.com/fit/c/50/50/1*hTSSbtIb77jQnKo4rxFLSw.png)](https://medium.com/@RawMean?source=post_header_lockup)[Ramin Rezaiifar](https://medium.com/@RawMean)FollowMar 18, 2018

*\[I’ve started writing these short articles as a series of ‘note-to-self’ reminders for myself. I’m sharing them hoping that someone else finds them useful too]*

If you have ever tried to train a network with tons of data (e.g., images) which do not fit in memory, you must have discovered that you need to use some sort of data generator which creates data in batches and feed it to your network for training.

TensorFlow has its own generator, but the API is unnecessarily complex and it’s easy to get it wrong. I’m a big fan of Keras because it removes the overhead of learning a cumbersome API (that’ll probably change in the next few months, anyway) and makes it possible to focus on your design.

Another advantage of using `Sequence` class in Keras as batch generator is that Keras handles all the multi-threading and parallelization to ensure that (as much as possible), your training (backprop) does not have to wait for batch generation. It does this behind the scene by fetching the batches ahead of time using multiple CPU cores.

#### Use Sequence Class from Keras <a href="#id-387c" id="id-387c"></a>

Keras [documentation](https://keras.io/utils/#sequence) has the perfect example (I just refactored some variable names to make it more clear):

```python
from skimage.io import imread
from skimage.transform import resize
import numpy as np

class MY_Generator(Sequence):

    def __init__(self, image_filenames, labels, batch_size):
        self.image_filenames, self.labels = image_filenames, labels
        self.batch_size = batch_size

    def __len__(self):
        return np.ceil(len(self.image_filenames) / float(self.batch_size))

    def __getitem__(self, idx):
        batch_x = self.image_filenames[idx * self.batch_size:(idx + 1) * self.batch_size]
        batch_y = self.labels[idx * self.batch_size:(idx + 1) * self.batch_size]

        return np.array([
            resize(imread(file_name), (200, 200))
               for file_name in batch_x]), np.array(batch_y)
```

Let’s go through this example and see what it does:

**Line 5**: Our generator class inherits from the Sequence class.

**Line 7**: We can feed parameters to our generator here. In this example, we pass image filenames as `self.image_filenames` and their corresponding labels as `self.labels` and the batch size as `self.batch_size`

**Line 11**: This function computes the number of batches that this generator is supposed to produce. So, we divide the number of total samples by the batch\_size and return that value.

**Line 14**: Here, given the batch number`idx` you need to put together a list that consists of data batch and the ground-truth (GT). In this example, we read a batch images of size `self.batch` and return an array of form`[image_batch, GT]`.

Ok, so we have created our data generator. Next step is to create instances of this class and feed them to `fit_generator` :

```python
   my_training_batch_generator = My_Generator(training_filenames, GT_training, batch_size)
   my_validation_batch_generator = My_Generator(validation_filenames, GT_validation, batch_size)

   model.fit_generator(generator=my_training_batch_generator,
                                          steps_per_epoch=(num_training_samples // batch_size),
                                          epochs=num_epochs,
                                          verbose=1,
                                          validation_data=my_validation_batch_generator,
                                          validation_steps=(num_validation_samples // batch_size),
                                          use_multiprocessing=True,
                                          workers=16,
                                          max_queue_size=32)
```

**Lines 1,2**: Instantiate two instances of My\_Generator (one for training and one for validation) and initialize them with image filenames for training and validation and the ground-truth for training and validation sets.

**Line 4**: Feed the two instances of `My_Generator` class created above and feed it to `fit_generator`. That’s it; we are done!

If you have multiple CPU cores, set the `use_multiprocessing` to `True` so that generators can run in parallel on CPU. Set`workers` to the number of CPU cores that you want to be allocated to batch generation.
