Datasets
libcll has 9 synthetic complementary-label datasets, including MNIST, KMNIST, FMNIST, CIFAR10, and CIFAR20 imported from PyTorch, alongside Yeast, Control, Dermatology, and Texture imported from OpenML. Also, libcll provides 2 real-world datasets, CLCIFAR10 and CLCIFAR20.
Dataset |
Number of Classes |
Input Size |
Description |
|---|---|---|---|
MNIST |
10 |
28 x 28 |
Grayscale images of handwritten digits (0 to 9) |
FMNIST |
10 |
28 x 28 |
Grayscale images of fashion items |
KMNIST |
10 |
28 x 28 |
Grayscale images of cursive Japanese characters |
Yeast |
10 |
8 |
Features of different localization sites of protein |
Texture |
11 |
40 |
Features of different textures |
Dermatology |
6 |
130 |
Clinical Attributes of different diseases |
Control |
6 |
60 |
Synthetic Control Chart Time Series |
Micro ImageNet10 |
10 |
3 x 64 x 64 |
Contains images of 10 classes designed for computer vision research |
Micro ImageNet20 |
20 |
3 x 64 x 64 |
Contains images of 20 classes designed for computer vision research |
CIFAR10 |
10 |
3 x 32 x 32 |
Colored images of distinct objects |
CIFAR20 |
20 |
3 x 32 x 32 |
Colored images of distinct objects |
CLMicro ImageNet10 |
10 |
3 x 64 x 64 |
Containing images of 10 classes designed for computer vision research paired with complementary labels annotated by humans |
CLMicro ImageNet20 |
20 |
3 x 64 x 64 |
Containing images of 20 classes designed for computer vision research paired with complementary labels annotated by humans |
CLCIFAR10 |
10 |
3 x 32 x 32 |
Colored images of distinct objects paired with complementary labels annotated by humans |
CLCIFAR10 |
10 |
3 x 32 x 32 |
Colored images of distinct objects paired with complementary labels annotated by humans |
Custom complementary-label dataset
We provide base class to easily create complementary-label dataset not included in libcll.
Users can effortlessly generate custom dataset inherited from libcll.datasets.CLBaseDataset and redefine __get_item__() if needed.
import torch
from PIL import Image
import numpy as np
import torchvision.transforms as transforms
from libcll.datasets import CLBaseDataset
from torchvision.datasets import SVHN
train_set = SVHN(root="./data/svhn", split="train", download=True)
X_train = train_set.data
Y_train = torch.from_numpy(train_set.labels)
class CLSVHN(CLBaseDataset):
def __getitem__(self, index):
img, target = self.data[index], self.targets[index]
img = Image.fromarray(np.transpose(img, (1, 2, 0)))
transform = transforms.ToTensor()
img = transform(img)
return img, target
train_set = CLSVHN(
X=X_train,
Y=Y_train,
num_classes=10
)
train_set.gen_complementary_target()
Generate complementary labels using transition matrix
libcll provides 4 types of commonly-used transition matrices for complementary-label generation from libcll.datasets.utils.get_transition_matrix(transition_matrix, num_classes).
Notice that weak, strong, and noise transition matrices are designed specifically for datasets containing 10 classes.
Users can generate complementary labels based on their desired distribution by passing transition matrix to CLBaseDataset.gen_complementary_target().
Transition matrix |
Description |
|---|---|
|
a uniform transition matrix where the diagonal elements are zero, and all non-diagonal elements are equal to \(\frac{1}{K - 1}\), \(K\) representing |
|
a biased transition matrix simulate milder deviation from uniform distribution, where the diagonal elements are zero and all non-diagonal elements randomly set. |
|
a biased transition matrix simulate stronger deviation from uniform distribution, where the diagonal elements are zero and all non-diagonal elements randomly set. |
|
a noisy transition matrix where the diagonal elements are not necessary zero, and equals to \((1-\lambda)T_{\text{strong}}+\lambda\frac{1}{K}1_{K}\), \(\lambda\) representing the weight of noise. |
from libcll.datasets import CLMNIST
from libcll.datasets.utils import get_transition_matrix
train_set = CLMNIST(root="./data/mnist", train=True)
transition_matrix = get_transition_matrix(
transition_matrix="weak",
num_classes=train_set.num_classes
)
train_set.gen_complementary_target(transition_matrix)
Multiple complementary-label dataset
libcll offers two types of multiple complementary-label learning settings by the parameter num_cl, which specifies the number of complementary labels for each instance.
When set to zero, num_cl triggers random sampling of the number of complementary labels per data instance before actual complementary-label sampling.
Since each data has multiple complementary labels, batch decomposition is necessary before passing it to the learner.
We provide two different collate function in libcll.datasets.utils for dataloader, collate_fn_multi_label duplicates image inputs to align with target lengths, while collate_fn_one_hot uses one-hot vectors to store multiple labels.
from torch.utils.data import random_split, DataLoader
from libcll.datasets import CLMNIST
from libcll.datasets.utils import collate_fn_multi_label
train_set = CLMNIST(root="./data/mnist", train=True)
test_set = CLMNIST(root="./data/mnist", train=False)
train_set.gen_complementary_target(num_cl=3)
input_dim = train_set.input_dim
num_classes = train_set.num_classes
batch_size = 256
train_set, valid_set = random_split(train_set, [0.9, 0.1])
train_loader = DataLoader(train_set, batch_size=batch_size, collate_fn=collate_fn_multi_label, shuffle=True, num_workers=4)
valid_loader = DataLoader(valid_set, batch_size=batch_size, collate_fn=collate_fn_multi_label, shuffle=False, num_workers=4)
test_loader = DataLoader(test_set, batch_size=batch_size, shuffle=False, num_workers=4)