syaffers.xyz

Building efficient custom Datasets in PyTorch

#python #deep-learning #tutorials

May 15, 2019

PyTorch has been around my circles as of late and I had to try it out despite being comfortable with Keras and TensorFlow for a while. Surprisingly, I found it quite refreshing and likable, especially as PyTorch features a Pythonic API, a more opinionated programming pattern and a good set of built-in utility functions. One that I enjoy particularly well is the ability to easily craft a custom Dataset object which can then be used with the built-in DataLoader to feed data when training a model.

In this article, I will be exploring the PyTorch Dataset object from the ground up with the objective of making a dataset for handling text files and how one could go about optimizing the pipeline for a certain task. We start by going over the basics of the Dataset utility with a toy example and work our way up to the real task. Specifically, we want to create a pipeline to feed first names of character names, from The Elder Scrolls (TES) series, the race of those character names and the gender of the names as one-hot tensors. You can find this dataset on my website.

Basics of the Dataset class

PyTorch gives you the freedom to pretty much do anything with the Dataset class so long as you override two of the subclass functions:

The size of the dataset can be a grey area sometimes but it would be equal to the number of samples that you have in the entire dataset. So if you have 10,000 words (or data points, images, sentences, etc.) in your dataset, the __len__ function should return 10,000.

A minimal working example

Let’s first mock a simple dataset by creating a Dataset of all numbers from 1 to 1000. We’ll aptly name this the NumbersDataset.

from torch.utils.data import Dataset

class NumbersDataset(Dataset):
    def __init__(self):
        self.samples = list(range(1, 1001))

    def __len__(self):
        return len(self.samples)

    def __getitem__(self, idx):
        return self.samples[idx]


if __name__ == '__main__':
    dataset = NumbersDataset()
    print(len(dataset))
    print(dataset[100])
    print(dataset[122:361])
$ python datasets.py
1000
101
[123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135,
 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148,
 ...

Pretty simple, right? First, when we initialize the NumbersDataset, we immediately create a list called samples which will store all the numbers between 1 and 1000. The name samples is arbitrary, so feel free to use whatever name you feel comfortable with. The overridden functions are self-explanatory (I hope!) and operate on the list which was initiated in the constructor. If you run the file, you will see the values 1000, 101 and a list between 122 and 361 printed out which are the length of the dataset, the value of the data at index 100 in the dataset and the slice of the dataset between indices 121 and 361, respectively.

Extending the dataset

Let’s extend this dataset so that it can store all whole numbers between an interval low and high.

from torch.utils.data import Dataset

class NumbersDataset(Dataset):
    def __init__(self, low, high):
        self.samples = list(range(low, high))

    def __len__(self):
        return len(self.samples)

    def __getitem__(self, idx):
        return self.samples[idx]


if __name__ == '__main__':
    dataset = NumbersDataset(2821, 8295)
    print(len(dataset))
    print(dataset[100])
    print(dataset[122:361])
$ python datasets.py
5474
2921
[2943, 2944, 2945, 2946, 2947, 2948, 2949, 2950, 2951, 2952, 2953,
 2954, 2955, 2956, 2957, 2958, 2959, 2960, 2961, 2962, 2963, 2964,
 ...

The code above the code should print 5474, 2921 and the list of numbers between 2943 and 3181. By editing the constructor, we can now set arbitrary low and high values of the dataset to our heart’s content. This simple change shows what kind of mileage we can get from the PyTorch Dataset class. We can generate multiple different datasets and play around with the values without having to think about coding a new class or creating many hard-to-follow matrices as we would in NumPy, for example.

Reading data from files

Let’s take this idea of extending the functionality of the Dataset class much further. PyTorch interfaces with the Python standard libraries quite gracefully meaning that you do not have to feel afraid of integrating features that you already know and love. Here, we will be

For reference, the TES character names dataset has the following directory structure:

.
|-- Altmer/
|   |-- Female
|   `-- Male
|-- Argonian/
|   |-- Female
|   `-- Male
... (truncated for brevity)
`-- Redguard/
    |-- Female
    `-- Male

Each of the files contains TES character names separated by newlines so we must read each file, line by line, to capture all of the names of the characters for each race and gender.

import os
from torch.utils.data import Dataset

class TESNamesDataset(Dataset):
    def __init__(self, data_root):
        self.samples = []

        for race in os.listdir(data_root):
            race_folder = os.path.join(data_root, race)

            for gender in os.listdir(race_folder):
                gender_filepath = os.path.join(race_folder, gender)

                with open(gender_filepath, 'r') as gender_file:
                    for name in gender_file.read().splitlines():
                        self.samples.append((race, gender, name))

    def __len__(self):
        return len(self.samples)

    def __getitem__(self, idx):
        return self.samples[idx]


if __name__ == '__main__':
    dataset = TESNamesDataset('/home/syafiq/Data/tes-names/')
    print(len(dataset))
    print(dataset[420])

Let’s go through the code: we first create an empty samples list and populate it by going through each race folder and gender file and reading each file for the names. The race, gender, and names are then stored in a tuple and appended into the samples list. Running the file should print 19491 and ('Bosmer', 'Female', 'Gluineth') (but may differ from one computer to another). Let’s have a look at what it would look like if we sliced the dataset into a batch:

# change the main function to the following:
if __name__ == '__main__':
    dataset = TESNamesDataset('/home/syafiq/Data/tes-names/')
    print(dataset[10:60])
$ python datasets.py
[('Bosmer', 'Female', 'Agafos'), ('Bosmer', 'Female', 'Agalrin'),
 ('Bosmer', 'Female', 'Agilruin'), ('Bosmer', 'Female', 'Aglaril'),
 ...

As you might expect, it works exactly as a typical list would. To sum up this section, we have just introduced standard Python I/O into the PyTorch dataset and we did not need any other special wrappers or helpers, just pure Python. In fact, we can also include other libraries like NumPy or Pandas and, with a bit of clever manipulation, have them play well with PyTorch. Let’s stop there for now and look at how to efficiently iterate through the dataset is the case of a training loop.

Flowing data with the DataLoader

While the Dataset class is a nice way of containing data systematically, it seems that in a training loop, we will need to index or slice the dataset’s samples list. This is no better than what we would do for a typical list or NumPy matrix. Rather than going down that route, PyTorch supplies another utility function called the DataLoader which acts as a data feeder for a Dataset object. The parallel I see here is the data generator flow function in Keras, if you are familiar with that. The DataLoader takes a Dataset object (and, therefore, any subclass extending it) and several other optional parameters (listed on the PyTorch DataLoader docs). Among the parameters, we have the option of shuffling the data, determining the batch size and the number of workers to load data in parallel. Here is a simple example of flowing through the TESNamesDataset in a enumerate loop.

# change the main function to the following:
if __name__ == '__main__':
    from torch.utils.data import DataLoader
    dataset = TESNamesDataset('/home/syafiq/Data/tes-names/')
    dataloader = DataLoader(dataset, batch_size=50, shuffle=True, num_workers=2)

    for i, batch in enumerate(dataloader):
        print(i, batch)

As you watch the torrent of batches get printed out, you might notice that each batch is a list of three-tuples: a bunch of races in the first tuple, the genders in the next and the names in the last.

$ python datasets.py
...
389 [('Argonian', 'Dunmer', 'Breton', ...),
     ('Female', 'Female', 'Female', ...),
     ('Seed-Neeus', 'Vaynonah', 'Amarie', ...)]

Hang on, that is not how it looked like when we sliced our dataset earlier! What’s going on here? Well, as it turns out, the DataLoader loads the data in a systematic way such that we stack data vertically instead of horizontally. This is particularly useful for flowing batches of tensors as tensors stack vertically (i.e. in the first dimension) to form batches. Also, the DataLoader also handled the shuffling of data for you so there’s no need to shuffle matrices or keep track of indices when feeding data.

Flowing tensors and other types

To explore further how different types of data is being flowed by the DataLoader, we will update the numbers dataset we mocked earlier to yield two pairs of tensors: a tensor of 4 successor values for each number in the dataset, and the same successor tensor but with some random noise added into it. To throw the DataLoader a curveball, we will also want to return the number itself but not as a tensor, but as a Python string. In total, the __getitem__ function would return three heterogeneous data items in a tuple.

from torch.utils.data import Dataset
import torch

class NumbersDataset(Dataset):
    def __init__(self, low, high):
        self.samples = list(range(low, high))

    def __len__(self):
        return len(self.samples)

    def __getitem__(self, idx):
        n = self.samples[idx]
        successors = torch.arange(4).float() + n + 1
        noisy = torch.randn(4) + successors
        return str(n), successors, noisy


if __name__ == '__main__':
    from torch.utils.data import DataLoader

    dataset = NumbersDataset(100, 120)
    dataloader = DataLoader(dataset, batch_size=10, shuffle=True)
    print(next(iter(dataloader)))

Note that we have not changed the dataset constructor but rather the __getitem__ function. Good practice for PyTorch datasets is that you keep in mind how the dataset will scale with more and more samples and, therefore, we do not want to store too many tensors in memory at runtime in the Dataset object. Instead, we will form the tensors as we iterate through the samples list, trading off a bit of speed for memory. I will explain how this is useful in the following sections.

$ python datasets.py
[
    ('109', '100', '114', '107', ...),
    tensor([[110., 111., 112., 113.],
            [101., 102., 103., 104.],
            ...
          ),
    tensor([[109.9972, 110.9166, 111.8880, 112.2871],
            [100.3679, 101.2552, 102.7933, 103.5752],
            ...
          )
]

Looking at the output above, although our new __getitem__ function returns a monstrous tuple of string and tensors, the DataLoader is able to recognize the data and stack them accordingly. The stringified numbers are formed as a tuple with the size of the loader’s configured batch size. For the two tensors, the DataLoader vertically stacked them into a tensor of size 10x4. This is because we configured the batch size to be 10 and the two tensors that are returned from the __getitem__ function is of size 4.

In general, the loader will try stacking batches of 1-dimensional tensors into 2-dimensional tensors, batches of 2-dimensional tensors into 3-dimensional tensors, and so on. At this point, I implore you to realize the life-changing impact this has on traditional data handling in other machine learning libraries and how clean the solution looks. It is quite incredible! If you are not sharing my sentiments, well, at least you now know one other method that you can have in your toolbox.

Completing the TES dataset code

Let’s get back to the TES names dataset. It seems like the initialization function is a little dirty (at least for my standards) and there should really be a way to make the code look better. Remember that I said the PyTorch API is Pythonic? Well, there is no stopping you from declaring other utility functions in your dataset or even making internal functions for initialization. To clean up the TES names dataset code, we will update the TESNamesDataset code to achieve the following:

To enable the utility functions to work well, we will get some help from the scikit-learn library to encode nominal values (i.e. our race, gender, and name data). Specifically, we will need the LabelEncoder class. As we are making a large update to the code, I will explain the changes in the next several subsections.

import os
from sklearn.preprocessing import LabelEncoder
from torch.utils.data import Dataset
import torch

class TESNamesDataset(Dataset):
    def __init__(self, data_root, charset):
        self.data_root = data_root
        self.charset = charset
        self.samples = []
        self.race_codec = LabelEncoder()
        self.gender_codec = LabelEncoder()
        self.char_codec = LabelEncoder()
        self._init_dataset()

    def __len__(self):
        return len(self.samples)

    def __getitem__(self, idx):
        race, gender, name = self.samples[idx]
        return self.one_hot_sample(race, gender, name)

    def _init_dataset(self):
        races = set()
        genders = set()

        for race in os.listdir(self.data_root):
            race_folder = os.path.join(self.data_root, race)
            races.add(race)

            for gender in os.listdir(race_folder):
                gender_filepath = os.path.join(race_folder, gender)
                genders.add(gender)

                with open(gender_filepath, 'r') as gender_file:
                    for name in gender_file.read().splitlines():
                        self.samples.append((race, gender, name))

        self.race_codec.fit(list(races))
        self.gender_codec.fit(list(genders))
        self.char_codec.fit(list(self.charset))

    def to_one_hot(self, codec, values):
        value_idxs = codec.transform(values)
        return torch.eye(len(codec.classes_))[value_idxs]

    def one_hot_sample(self, race, gender, name):
        t_race = self.to_one_hot(self.race_codec, [race])
        t_gender = self.to_one_hot(self.gender_codec, [gender])
        t_name = self.to_one_hot(self.char_codec, list(name))
        return t_race, t_gender, t_name


if __name__ == '__main__':
    import string

    data_root = '/home/syafiq/Data/tes-names/'
    charset = string.ascii_letters + "-' "
    dataset = TESNamesDataset(data_root, charset)
    print(len(dataset))
    print(dataset[420])

The constructor-init split

There is quite a bit of change here so let’s go through it bit by bit. Starting with the constructor, you may have noticed it is clear of any file processing logic. We have moved this logic into the _init_dataset function, and cleaned up the constructor. Additionally, we have added some empty codecs to convert nominal values from the original string into an integer and back. The samples list is also just an empty list which will be populated in the _init_dataset function. The constructor also takes in a new argument which is called charset. As the name suggests it is just a string of characters which will enable the char_codec to convert characters into integers.

The file processing functionality has been augmented with a couple of sets to capture the unique nominal values like race and gender as we iterate through the folders. This can be useful if you don’t have well-structured datasets; for example, if the Argonians had another set of names which are gender agnostic, we would have a file called “Unknown” and this would be put into the set of genders regardless of the existence of “Unknown” genders for other races. After all the names have been stored, we will initialize the codecs by fitting it to the set of unique values of races, genders, and characters in our character set.

Utility functions

There are two utility functions that were added: to_one_hot and one_hot_sample. to_one_hot uses the internal codecs of the dataset to first convert a list of values into a list of integers before applying a seemingly out-of-place torch.eye function. This is actually a neat hack to quickly convert a list of integers into one-hot vectors. The torch.eye function creates an identity matrix of an arbitrary size which has a value of 1 on its diagonal. If you index the matrix rows, you get a row vector with the value of 1 at that index, which is the definition of a one-hot vector!

+----------+        transform()  +---+
| Imperial |                 |   | 2 |     torch.eye(4)    [0 0 1 0]
| Khajiit  |      +-------+  |   | 3 |      (1 0 0 0)      [0 0 0 1]
| Dunmer   | ---> | Race  | ---> | 1 | ---> (0 1 0 0) ---> [0 1 0 0]
| Imperial |      | Codec |      | 2 |      (0 0 1 0)  |   [0 0 1 0]
| Khajiit  |      +-------+      | 3 |      (0 0 0 1)  |   [0 0 0 1]
| Imperial |                     | 2 |                 |   [0 0 1 0]
+----------+                     +---+             (indexing)

Because we need to convert three values into tensors, we will call the to_one_hot function on each of the codec we have on the corresponding data. This is composed in the one_hot_sample function which converts a single sample into a tuple of tensors. The race and gender get converted into a 2-dimensional tensor which is really an expanded row vector. The name gets converted into a 2-dimensional tensor too but comprises one-hot row vectors of each of the character of the name.

The __getitem__ call

Finally, the __getitem__ function has been updated to only call the one_hot_sample function given the race, gender, and name of a sample. Notice that we do not need to prepare the tensors beforehand in the samples list but rather the tensors are formed only when the __getitem__ function is called, which is when the DataLoader flows the data. This makes the dataset very scalable when you have hundreds of thousands of samples to flow during training.

You can imagine how this the dataset could be used in the scenario of vision training. The dataset would have a list of filenames and the path to the directory of images leaving the __getitem__ function to only read the image files and convert them into tensors just in time for training. This can be made to run much faster by providing an appropriate number of workers to the DataLoader to process multiple image files in parallel. The PyTorch data loading tutorial covers image datasets and loaders in more detail and complements datasets with the torchvision package (that is often installed alongside PyTorch) for computer vision purposes, making image manipulation pipelines (like whitening, normalization, random shifting, etc.) very easy to construct.

Back to this article. The dataset checks out and it looks like we are ready to use this for training…

… but we are not

If we try to flow the data using the DataLoader with a batch size greater than 1, we will be greeted with an error:

RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 7 and 9 in dimension 1 at /opt/conda/conda-bld/pytorch-cpu_1544218667092/work/aten/src/TH/generic/THTensorMoreMath.cpp:1333

The astute among you might have seen this coming but the reality is that text data rarely come in fixed lengths from one sample to another. As a result, the DataLoader tries to batch multiple name tensors of different lengths, which is impossible in the tensor format, as it would also be in NumPy arrays. To illustrate this problem, consider the case when we have names like ‘John’ and ‘Steven’ to stack together into a single one-hot matrix. ‘John’ translates into a 2-dimensional tensor of size 4xC and ‘Steven’ translates into a 2-dimensional tensor of size 6xC where C is the length of the character set. The DataLoader tries to batch the names into a 3-dimensional tensor 2x?xC (think of stacking tensors of sizes 1x4xC and 1x6xC). Due to the mismatch in the second dimension, the DataLoader raises an error as it could not proceed.

Possible solutions

To remedy this, here are two approaches and each has its pros and cons.

I’m going with the second option, for the purposes of this article, to show that you need very few changes to the overall data pipeline to achieve this. Note that this also works for any sequential data of differing lengths (although there are various methods to pad data, see the options in NumPy and in PyTorch). In my use case, I have opted to pad the names with zeros so I updated the constructor and _init_dataset functions:

...
    def __init__(self, data_root, charset, length):
        self.data_root = data_root
        self.charset = charset + '\0'
        self.length = length
    ...
        with open(gender_filepath, 'r') as gender_file:
            for name in gender_file.read().splitlines():
                if len(name) < self.length:
                    name += '\0' * (self.length - len(name))
                else:
                    name = name[:self.length-1] + '\0'
                self.samples.append((race, gender, name))
    ...

First, I introduce a new parameter to the constructor, length, which fixes the number of characters of all incoming names to this value. I’ve also added \0 into the character set as the dummy character for padding out the short names. Next, the dataset initialization logic was updated. Names which are deficient of the fixed length are simply padded with \0s until the length requirement was met. Names which are in excess of the fixed length are truncated down to size and the last character is swapped with a \0. The swapping is optional depends on the task at hand.

And if you try to flow this dataset now, you should get what you initially expected: a well-formed tensor flowed in the desired batch size. The image below shows a batch of size 2 but note that there are three tensors:

$ python run_dataloader.py
[
    tensor([[[0., 0., 0., 0., 0., 0., 0., 0., 1., 0.]],
            [[0., 0., 0., 0., 0., 0., 0., 1., 0., 0.]]]),
    tensor([[[0., 1.],
             [1., 0.]]]),
    tensor([[[0., 0., 0., ..., 0., 0., 0.],
             [0., 0., 0., ..., 0., 0., 0.],
             [0., 0., 0., ..., 0., 0., 0.],
             ...,
             [1., 0., 0., ..., 0., 0., 0.],
             [1., 0., 0., ..., 0., 0., 0.],
             [1., 0., 0., ..., 0., 0., 0.]],

            [[0., 0., 0., ..., 0., 0., 0.],
             [0., 0., 0., ..., 0., 0., 0.],
             [0., 0., 0., ..., 0., 0., 0.],
             ...,
             [1., 0., 0., ..., 0., 0., 0.],
             [1., 0., 0., ..., 0., 0., 0.],
             [1., 0., 0., ..., 0., 0., 0.]]])
]

Data splitting utility

All these functionalities come built into PyTorch which is awesome. The questions that might arise now is how one might approach making validation or even testing sets and how to execute this without cluttering up the code base and keeping it as DRY as possible. One approach for testing sets is to supply a different data_root for the training data and testing data and keeping two dataset variables at runtime (and, additionally, two data loaders), especially if you want to test immediately after training.

If instead, you want to create validation sets from the training set, this can be handled easily using the random_split function from the PyTorch data utilities. The random_split function takes in a dataset and the desired sizes of the subsets int a list and automatically splits the data in a random order to generate smaller Dataset objects which are immediately usable with the DataLoader. Here’s an example.

import string
from torch.utils.data import DataLoader, random_split
from datasets import TESNamesDataset

data_root = '/home/syafiq/Data/tes-names/'
charset = string.ascii_letters + "-' "
length = 30

dataset = TESNamesDataset(data_root, charset, length)
trainset, valset = random_split(dataset, [15593, 3898])

train_loader = DataLoader(trainset, batch_size=10, shuffle=True, num_workers=2)
val_loader = DataLoader(valset, batch_size=10, shuffle=True, num_workers=2)

for i, batch in enumerate(train_loader):
    print(i, batch)

for i, batch in enumerate(val_loader):
    print(i, batch)

In fact, you can split at arbitrary intervals which make this very powerful for folded cross-validation sets. The only gripe I have with this method is that you can not define percentage splits which is rather annoying. At least the sizes of the sub-datasets are clearly defined from the get-go. Also, note that you need separate DataLoaders for each dataset, which is definitely cleaner than managing two randomly sorted datasets and indexing within a loop.

Closing remarks

I hope this article has given you a glimpse into the power of the Dataset and DataLoader utilities in PyTorch. Combined with the clean, Pythonic API, it just makes coding just that much more pleasant while still supplying an efficient way of handling data. I think the PyTorch developers have ease of use well ingrained into their philosophy of development and, after using PyTorch at my workplace, I have since not looked back to using Keras and TensorFlow much. I have to say I do miss the progress bar and fit/predict API that comes with Keras models, but this is a minor setback as the latest PyTorch now interfaces with TensorBoard bringing back a familiar working environment. Nevertheless, at the moment, PyTorch is my go-to for future deep learning projects.

I encourage building your own datasets this way as it remedied much of the messy programming habits I have had with managing data previously. The Dataset utility is a life-saver in complicated situations. I recall having to manage data belonging to a single sample but sourced from three different MATLAB matrix files and needed to be sliced, normalized and transposed correctly. I could not fathom the effort to manage that without the Dataset and DataLoader combo, especially since the data was massive and there was no easy way to combine it all into a NumPy matrix without any computer crashing.

Finally, check out the PyTorch data utilities documentation page which has other classes and functions to explore, it’s a small but valuable utility library. You can find the code for the TES names dataset on my GitHub where I have created an LSTM name predictor in PyTorch in tandem with the dataset. Let me know if this article was helpful or unclear and if you would like more of this type of content in the future.