Building efficient custom Datasets in PyTorch
May 15, 2019
PyTorch has been around my circles as of late and I had to try it out despite being comfortable with Keras and TensorFlow for a while. Surprisingly, I found it quite refreshing and likable, especially as PyTorch features a Pythonic API, a more opinionated programming pattern and a good set of built-in utility functions. One that I enjoy particularly well is the ability to easily craft a custom Dataset object which can then be used with the built-in DataLoader to feed data when training a model.
In this article, I will be exploring the PyTorch Dataset object from the ground up with the objective of making a dataset for handling text files and how one could go about optimizing the pipeline for a certain task. We start by going over the basics of the Dataset utility with a toy example and work our way up to the real task. Specifically, we want to create a pipeline to feed first names of character names, from The Elder Scrolls (TES) series, the race of those character names and the gender of the names as one-hot tensors. You can find this dataset on my website.
Basics of the Dataset class
PyTorch gives you the freedom to pretty much do anything with the Dataset class so long as you override two of the subclass functions:
- the
__len__function which returns the size of the dataset, and - the
__getitem__function which returns a sample from the dataset given an index.
The size of the dataset can be a grey area sometimes but it would be equal to the number of samples that you have in the entire dataset. So if you have 10,000 words (or data points, images, sentences, etc.) in your dataset, the __len__ function should return 10,000.
A minimal working example
Let’s first mock a simple dataset by creating a Dataset of all numbers from 1 to 1000. We’ll aptly name this the NumbersDataset.
from torch.utils.data import Dataset
class NumbersDataset(Dataset):
def __init__(self):
self.samples = list(range(1, 1001))
def __len__(self):
return len(self.samples)
def __getitem__(self, idx):
return self.samples[idx]
if __name__ == '__main__':
dataset = NumbersDataset()
print(len(dataset))
print(dataset[100])
print(dataset[122:361])
$ python datasets.py
1000
101
[123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135,
136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148,
...
Pretty simple, right? First, when we initialize the NumbersDataset, we immediately create a list called samples which will store all the numbers between 1 and 1000. The name samples is arbitrary, so feel free to use whatever name you feel comfortable with. The overridden functions are self-explanatory (I hope!) and operate on the list which was initiated in the constructor. If you run the file, you will see the values 1000, 101 and a list between 122 and 361 printed out which are the length of the dataset, the value of the data at index 100 in the dataset and the slice of the dataset between indices 121 and 361, respectively.
Extending the dataset
Let’s extend this dataset so that it can store all whole numbers between an interval low and high.
from torch.utils.data import Dataset
class NumbersDataset(Dataset):
def __init__(self, low, high):
self.samples = list(range(low, high))
def __len__(self):
return len(self.samples)
def __getitem__(self, idx):
return self.samples[idx]
if __name__ == '__main__':
dataset = NumbersDataset(2821, 8295)
print(len(dataset))
print(dataset[100])
print(dataset[122:361])
$ python datasets.py
5474
2921
[2943, 2944, 2945, 2946, 2947, 2948, 2949, 2950, 2951, 2952, 2953,
2954, 2955, 2956, 2957, 2958, 2959, 2960, 2961, 2962, 2963, 2964,
...
The code above the code should print 5474, 2921 and the list of numbers between 2943 and 3181. By editing the constructor, we can now set arbitrary low and high values of the dataset to our heart’s content. This simple change shows what kind of mileage we can get from the PyTorch Dataset class. We can generate multiple different datasets and play around with the values without having to think about coding a new class or creating many hard-to-follow matrices as we would in NumPy, for example.
Reading data from files
Let’s take this idea of extending the functionality of the Dataset class much further. PyTorch interfaces with the Python standard libraries quite gracefully meaning that you do not have to feel afraid of integrating features that you already know and love. Here, we will be
- creating a brand new
Datasetusing basic Python I/O and some static files, - collecting the TES character names (dataset available on my website), which are separated into race folders and gender files, to populate the
sampleslist, - keeping track of each names’ race and gender by storing a tuple in the
sampleslist rather than just the names themselves.
For reference, the TES character names dataset has the following directory structure:
.
|-- Altmer/
| |-- Female
| `-- Male
|-- Argonian/
| |-- Female
| `-- Male
... (truncated for brevity)
`-- Redguard/
|-- Female
`-- Male
Each of the files contains TES character names separated by newlines so we must read each file, line by line, to capture all of the names of the characters for each race and gender.
import os
from torch.utils.data import Dataset
class TESNamesDataset(Dataset):
def __init__(self, data_root):
self.samples = []
for race in os.listdir(data_root):
race_folder = os.path.join(data_root, race)
for gender in os.listdir(race_folder):
gender_filepath = os.path.join(race_folder, gender)
with open(gender_filepath, 'r') as gender_file:
for name in gender_file.read().splitlines():
self.samples.append((race, gender, name))
def __len__(self):
return len(self.samples)
def __getitem__(self, idx):
return self.samples[idx]
if __name__ == '__main__':
dataset = TESNamesDataset('/home/syafiq/Data/tes-names/')
print(len(dataset))
print(dataset[420])
Let’s go through the code: we first create an empty samples list and populate it by going through each race folder and gender file and reading each file for the names. The race, gender, and names are then stored in a tuple and appended into the samples list. Running the file should print 19491 and ('Bosmer', 'Female', 'Gluineth') (but may differ from one computer to another). Let’s have a look at what it would look like if we sliced the dataset into a batch:
# change the main function to the following:
if __name__ == '__main__':
dataset = TESNamesDataset('/home/syafiq/Data/tes-names/')
print(dataset[10:60])
$ python datasets.py
[('Bosmer', 'Female', 'Agafos'), ('Bosmer', 'Female', 'Agalrin'),
('Bosmer', 'Female', 'Agilruin'), ('Bosmer', 'Female', 'Aglaril'),
...
As you might expect, it works exactly as a typical list would. To sum up this section, we have just introduced standard Python I/O into the PyTorch dataset and we did not need any other special wrappers or helpers, just pure Python. In fact, we can also include other libraries like NumPy or Pandas and, with a bit of clever manipulation, have them play well with PyTorch. Let’s stop there for now and look at how to efficiently iterate through the dataset is the case of a training loop.
Flowing data with the DataLoader
While the Dataset class is a nice way of containing data systematically, it seems that in a training loop, we will need to index or slice the dataset’s samples list. This is no better than what we would do for a typical list or NumPy matrix. Rather than going down that route, PyTorch supplies another utility function called the DataLoader which acts as a data feeder for a Dataset object. The parallel I see here is the data generator flow function in Keras, if you are familiar with that. The DataLoader takes a Dataset object (and, therefore, any subclass extending it) and several other optional parameters (listed on the PyTorch DataLoader docs). Among the parameters, we have the option of shuffling the data, determining the batch size and the number of workers to load data in parallel. Here is a simple example of flowing through the TESNamesDataset in a enumerate loop.
# change the main function to the following:
if __name__ == '__main__':
from torch.utils.data import DataLoader
dataset = TESNamesDataset('/home/syafiq/Data/tes-names/')
dataloader = DataLoader(dataset, batch_size=50, shuffle=True, num_workers=2)
for i, batch in enumerate(dataloader):
print(i, batch)
As you watch the torrent of batches get printed out, you might notice that each batch is a list of three-tuples: a bunch of races in the first tuple, the genders in the next and the names in the last.
$ python datasets.py
...
389 [('Argonian', 'Dunmer', 'Breton', ...),
('Female', 'Female', 'Female', ...),
('Seed-Neeus', 'Vaynonah', 'Amarie', ...)]
Hang on, that is not how it looked like when we sliced our dataset earlier! What’s going on here? Well, as it turns out, the DataLoader loads the data in a systematic way such that we stack data vertically instead of horizontally. This is particularly useful for flowing batches of tensors as tensors stack vertically (i.e. in the first dimension) to form batches. Also, the DataLoader also handled the shuffling of data for you so there’s no need to shuffle matrices or keep track of indices when feeding data.
Flowing tensors and other types
To explore further how different types of data is being flowed by the DataLoader, we will update the numbers dataset we mocked earlier to yield two pairs of tensors: a tensor of 4 successor values for each number in the dataset, and the same successor tensor but with some random noise added into it. To throw the DataLoader a curveball, we will also want to return the number itself but not as a tensor, but as a Python string. In total, the __getitem__ function would return three heterogeneous data items in a tuple.
from torch.utils.data import Dataset
import torch
class NumbersDataset(Dataset):
def __init__(self, low, high):
self.samples = list(range(low, high))
def __len__(self):
return len(self.samples)
def __getitem__(self, idx):
n = self.samples[idx]
successors = torch.arange(4).float() + n + 1
noisy = torch.randn(4) + successors
return str(n), successors, noisy
if __name__ == '__main__':
from torch.utils.data import DataLoader
dataset = NumbersDataset(100, 120)
dataloader = DataLoader(dataset, batch_size=10, shuffle=True)
print(next(iter(dataloader)))
Note that we have not changed the dataset constructor but rather the __getitem__ function. Good practice for PyTorch datasets is that you keep in mind how the dataset will scale with more and more samples and, therefore, we do not want to store too many tensors in memory at runtime in the Dataset object. Instead, we will form the tensors as we iterate through the samples list, trading off a bit of speed for memory. I will explain how this is useful in the following sections.
$ python datasets.py
[
('109', '100', '114', '107', ...),
tensor([[110., 111., 112., 113.],
[101., 102., 103., 104.],
...
),
tensor([[109.9972, 110.9166, 111.8880, 112.2871],
[100.3679, 101.2552, 102.7933, 103.5752],
...
)
]
Looking at the output above, although our new __getitem__ function returns a monstrous tuple of string and tensors, the DataLoader is able to recognize the data and stack them accordingly. The stringified numbers are formed as a tuple with the size of the loader’s configured batch size. For the two tensors, the DataLoader vertically stacked them into a tensor of size 10x4. This is because we configured the batch size to be 10 and the two tensors that are returned from the __getitem__ function is of size 4.
In general, the loader will try stacking batches of 1-dimensional tensors into 2-dimensional tensors, batches of 2-dimensional tensors into 3-dimensional tensors, and so on. At this point, I implore you to realize the life-changing impact this has on traditional data handling in other machine learning libraries and how clean the solution looks. It is quite incredible! If you are not sharing my sentiments, well, at least you now know one other method that you can have in your toolbox.
Completing the TES dataset code
Let’s get back to the TES names dataset. It seems like the initialization function is a little dirty (at least for my standards) and there should really be a way to make the code look better. Remember that I said the PyTorch API is Pythonic? Well, there is no stopping you from declaring other utility functions in your dataset or even making internal functions for initialization. To clean up the TES names dataset code, we will update the TESNamesDataset code to achieve the following:
- update the constructor to include a character set,
- create an internal function to initialize the dataset,
- create a utility function that converts nominal variables into one-hot tensors,
- create a utility function that converts a sample into a set of three one-hot tensors representing the race, gender, and name.
To enable the utility functions to work well, we will get some help from the scikit-learn library to encode nominal values (i.e. our race, gender, and name data). Specifically, we will need the LabelEncoder class. As we are making a large update to the code, I will explain the changes in the next several subsections.
import os
from sklearn.preprocessing import LabelEncoder
from torch.utils.data import Dataset
import torch
class TESNamesDataset(Dataset):
def __init__(self, data_root, charset):
self.data_root = data_root
self.charset = charset
self.samples = []
self.race_codec = LabelEncoder()
self.gender_codec = LabelEncoder()
self.char_codec = LabelEncoder()
self._init_dataset()
def __len__(self):
return len(self.samples)
def __getitem__(self, idx):
race, gender, name = self.samples[idx]
return self.one_hot_sample(race, gender, name)
def _init_dataset(self):
races = set()
genders = set()
for race in os.listdir(self.data_root):
race_folder = os.path.join(self.data_root, race)
races.add(race)
for gender in os.listdir(race_folder):
gender_filepath = os.path.join(race_folder, gender)
genders.add(gender)
with open(gender_filepath, 'r') as gender_file:
for name in gender_file.read().splitlines():
self.samples.append((race, gender, name))
self.race_codec.fit(list(races))
self.gender_codec.fit(list(genders))
self.char_codec.fit(list(self.charset))
def to_one_hot(self, codec, values):
value_idxs = codec.transform(values)
return torch.eye(len(codec.classes_))[value_idxs]
def one_hot_sample(self, race, gender, name):
t_race = self.to_one_hot(self.race_codec, [race])
t_gender = self.to_one_hot(self.gender_codec, [gender])
t_name = self.to_one_hot(self.char_codec, list(name))
return t_race, t_gender, t_name
if __name__ == '__main__':
import string
data_root = '/home/syafiq/Data/tes-names/'
charset = string.ascii_letters + "-' "
dataset = TESNamesDataset(data_root, charset)
print(len(dataset))
print(dataset[420])
The constructor-init split
There is quite a bit of change here so let’s go through it bit by bit. Starting with the constructor, you may have noticed it is clear of any file processing logic. We have moved this logic into the _init_dataset function, and cleaned up the constructor. Additionally, we have added some empty codecs to convert nominal values from the original string into an integer and back. The samples list is also just an empty list which will be populated in the _init_dataset function. The constructor also takes in a new argument which is called charset. As the name suggests it is just a string of characters which will enable the char_codec to convert characters into integers.
The file processing functionality has been augmented with a couple of sets to capture the unique nominal values like race and gender as we iterate through the folders. This can be useful if you don’t have well-structured datasets; for example, if the Argonians had another set of names which are gender agnostic, we would have a file called “Unknown” and this would be put into the set of genders regardless of the existence of “Unknown” genders for other races. After all the names have been stored, we will initialize the codecs by fitting it to the set of unique values of races, genders, and characters in our character set.
Utility functions
There are two utility functions that were added: to_one_hot and one_hot_sample. to_one_hot uses the internal codecs of the dataset to first convert a list of values into a list of integers before applying a seemingly out-of-place torch.eye function. This is actually a neat hack to quickly convert a list of integers into one-hot vectors. The torch.eye function creates an identity matrix of an arbitrary size which has a value of 1 on its diagonal. If you index the matrix rows, you get a row vector with the value of 1 at that index, which is the definition of a one-hot vector!
+----------+ transform() +---+
| Imperial | | | 2 | torch.eye(4) [0 0 1 0]
| Khajiit | +-------+ | | 3 | (1 0 0 0) [0 0 0 1]
| Dunmer | ---> | Race | ---> | 1 | ---> (0 1 0 0) ---> [0 1 0 0]
| Imperial | | Codec | | 2 | (0 0 1 0) | [0 0 1 0]
| Khajiit | +-------+ | 3 | (0 0 0 1) | [0 0 0 1]
| Imperial | | 2 | | [0 0 1 0]
+----------+ +---+ (indexing)
Because we need to convert three values into tensors, we will call the to_one_hot function on each of the codec we have on the corresponding data. This is composed in the one_hot_sample function which converts a single sample into a tuple of tensors. The race and gender get converted into a 2-dimensional tensor which is really an expanded row vector. The name gets converted into a 2-dimensional tensor too but comprises one-hot row vectors of each of the character of the name.
The __getitem__ call
Finally, the __getitem__ function has been updated to only call the one_hot_sample function given the race, gender, and name of a sample. Notice that we do not need to prepare the tensors beforehand in the samples list but rather the tensors are formed only when the __getitem__ function is called, which is when the DataLoader flows the data. This makes the dataset very scalable when you have hundreds of thousands of samples to flow during training.
You can imagine how this the dataset could be used in the scenario of vision training. The dataset would have a list of filenames and the path to the directory of images leaving the __getitem__ function to only read the image files and convert them into tensors just in time for training. This can be made to run much faster by providing an appropriate number of workers to the DataLoader to process multiple image files in parallel. The PyTorch data loading tutorial covers image datasets and loaders in more detail and complements datasets with the torchvision package (that is often installed alongside PyTorch) for computer vision purposes, making image manipulation pipelines (like whitening, normalization, random shifting, etc.) very easy to construct.
Back to this article. The dataset checks out and it looks like we are ready to use this for training…
… but we are not
If we try to flow the data using the DataLoader with a batch size greater than 1, we will be greeted with an error:
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 7 and 9 in dimension 1 at /opt/conda/conda-bld/pytorch-cpu_1544218667092/work/aten/src/TH/generic/THTensorMoreMath.cpp:1333
The astute among you might have seen this coming but the reality is that text data rarely come in fixed lengths from one sample to another. As a result, the DataLoader tries to batch multiple name tensors of different lengths, which is impossible in the tensor format, as it would also be in NumPy arrays. To illustrate this problem, consider the case when we have names like ‘John’ and ‘Steven’ to stack together into a single one-hot matrix. ‘John’ translates into a 2-dimensional tensor of size 4xC and ‘Steven’ translates into a 2-dimensional tensor of size 6xC where C is the length of the character set. The DataLoader tries to batch the names into a 3-dimensional tensor 2x?xC (think of stacking tensors of sizes 1x4xC and 1x6xC). Due to the mismatch in the second dimension, the DataLoader raises an error as it could not proceed.
Possible solutions
To remedy this, here are two approaches and each has its pros and cons.
- Setting the batch size to be 1 so you will never encounter the error. If the batch size is one, the singleton tensor does not get stacked with any other tensors of (possibly) different lengths. However, this method suffers when performing training as neural networks converge very slowly on single batch gradient descent. On the other hand, this is good for fast test-time data loading or sandboxing when batches are not important.
- Fixing a uniform name length via padding with null-characters or truncating. Truncating down long names or padding short names with dummy characters allow for all names to be well-formed and have the same output tensor size making batching possible. The downside is, depending on the task at hand, dummy characters may be detrimental as it is not representative of the original data.
I’m going with the second option, for the purposes of this article, to show that you need very few changes to the overall data pipeline to achieve this. Note that this also works for any sequential data of differing lengths (although there are various methods to pad data, see the options in NumPy and in PyTorch). In my use case, I have opted to pad the names with zeros so I updated the constructor and _init_dataset functions:
...
def __init__(self, data_root, charset, length):
self.data_root = data_root
self.charset = charset + '\0'
self.length = length
...
with open(gender_filepath, 'r') as gender_file:
for name in gender_file.read().splitlines():
if len(name) < self.length:
name += '\0' * (self.length - len(name))
else:
name = name[:self.length-1] + '\0'
self.samples.append((race, gender, name))
...
First, I introduce a new parameter to the constructor, length, which fixes the number of characters of all incoming names to this value. I’ve also added \0 into the character set as the dummy character for padding out the short names. Next, the dataset initialization logic was updated. Names which are deficient of the fixed length are simply padded with \0s until the length requirement was met. Names which are in excess of the fixed length are truncated down to size and the last character is swapped with a \0. The swapping is optional depends on the task at hand.
And if you try to flow this dataset now, you should get what you initially expected: a well-formed tensor flowed in the desired batch size. The image below shows a batch of size 2 but note that there are three tensors:
$ python run_dataloader.py
[
tensor([[[0., 0., 0., 0., 0., 0., 0., 0., 1., 0.]],
[[0., 0., 0., 0., 0., 0., 0., 1., 0., 0.]]]),
tensor([[[0., 1.],
[1., 0.]]]),
tensor([[[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[1., 0., 0., ..., 0., 0., 0.],
[1., 0., 0., ..., 0., 0., 0.],
[1., 0., 0., ..., 0., 0., 0.]],
[[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[1., 0., 0., ..., 0., 0., 0.],
[1., 0., 0., ..., 0., 0., 0.],
[1., 0., 0., ..., 0., 0., 0.]]])
]
- the stacked race tensor, a one-hot encoded version of one of the ten races,
- the stacked gender tensors, also one-hot encoded for each of the two genders present in the dataset, and
- the stacked name tensors, which should be the length of
charsetin the last dimension, the length of the name (after fixing for size) in the second dimension, and the batch size in the first dimension.
Data splitting utility
All these functionalities come built into PyTorch which is awesome. The questions that might arise now is how one might approach making validation or even testing sets and how to execute this without cluttering up the code base and keeping it as DRY as possible. One approach for testing sets is to supply a different data_root for the training data and testing data and keeping two dataset variables at runtime (and, additionally, two data loaders), especially if you want to test immediately after training.
If instead, you want to create validation sets from the training set, this can be handled easily using the random_split function from the PyTorch data utilities. The random_split function takes in a dataset and the desired sizes of the subsets int a list and automatically splits the data in a random order to generate smaller Dataset objects which are immediately usable with the DataLoader. Here’s an example.
import string
from torch.utils.data import DataLoader, random_split
from datasets import TESNamesDataset
data_root = '/home/syafiq/Data/tes-names/'
charset = string.ascii_letters + "-' "
length = 30
dataset = TESNamesDataset(data_root, charset, length)
trainset, valset = random_split(dataset, [15593, 3898])
train_loader = DataLoader(trainset, batch_size=10, shuffle=True, num_workers=2)
val_loader = DataLoader(valset, batch_size=10, shuffle=True, num_workers=2)
for i, batch in enumerate(train_loader):
print(i, batch)
for i, batch in enumerate(val_loader):
print(i, batch)
In fact, you can split at arbitrary intervals which make this very powerful for folded cross-validation sets. The only gripe I have with this method is that you can not define percentage splits which is rather annoying. At least the sizes of the sub-datasets are clearly defined from the get-go. Also, note that you need separate DataLoaders for each dataset, which is definitely cleaner than managing two randomly sorted datasets and indexing within a loop.
Closing remarks
I hope this article has given you a glimpse into the power of the Dataset and DataLoader utilities in PyTorch. Combined with the clean, Pythonic API, it just makes coding just that much more pleasant while still supplying an efficient way of handling data. I think the PyTorch developers have ease of use well ingrained into their philosophy of development and, after using PyTorch at my workplace, I have since not looked back to using Keras and TensorFlow much. I have to say I do miss the progress bar and fit/predict API that comes with Keras models, but this is a minor setback as the latest PyTorch now interfaces with TensorBoard bringing back a familiar working environment. Nevertheless, at the moment, PyTorch is my go-to for future deep learning projects.
I encourage building your own datasets this way as it remedied much of the messy programming habits I have had with managing data previously. The Dataset utility is a life-saver in complicated situations. I recall having to manage data belonging to a single sample but sourced from three different MATLAB matrix files and needed to be sliced, normalized and transposed correctly. I could not fathom the effort to manage that without the Dataset and DataLoader combo, especially since the data was massive and there was no easy way to combine it all into a NumPy matrix without any computer crashing.
Finally, check out the PyTorch data utilities documentation page which has other classes and functions to explore, it’s a small but valuable utility library. You can find the code for the TES names dataset on my GitHub where I have created an LSTM name predictor in PyTorch in tandem with the dataset. Let me know if this article was helpful or unclear and if you would like more of this type of content in the future.