Datasets

libcll has 9 synthetic complementary-label datasets, including MNIST, KMNIST, FMNIST, CIFAR10, and CIFAR20 imported from PyTorch, alongside Yeast, Control, Dermatology, and Texture imported from OpenML. Also, libcll provides 2 real-world datasets, CLCIFAR10 and CLCIFAR20.

Dataset	Number of Classes	Input Size	Description
MNIST	10	28 x 28	Grayscale images of handwritten digits (0 to 9)
FMNIST	10	28 x 28	Grayscale images of fashion items
KMNIST	10	28 x 28	Grayscale images of cursive Japanese characters
Yeast	10	8	Features of different localization sites of protein
Texture	11	40	Features of different textures
Dermatology	6	130	Clinical Attributes of different diseases
Control	6	60	Synthetic Control Chart Time Series
Micro ImageNet10	10	3 x 64 x 64	Contains images of 10 classes designed for computer vision research
Micro ImageNet20	20	3 x 64 x 64	Contains images of 20 classes designed for computer vision research
CIFAR10	10	3 x 32 x 32	Colored images of distinct objects
CIFAR20	20	3 x 32 x 32	Colored images of distinct objects
CLMicro ImageNet10	10	3 x 64 x 64	Containing images of 10 classes designed for computer vision research paired with complementary labels annotated by humans
CLMicro ImageNet20	20	3 x 64 x 64	Containing images of 20 classes designed for computer vision research paired with complementary labels annotated by humans
CLCIFAR10	10	3 x 32 x 32	Colored images of distinct objects paired with complementary labels annotated by humans
CLCIFAR10	10	3 x 32 x 32	Colored images of distinct objects paired with complementary labels annotated by humans

Custom complementary-label dataset

We provide base class to easily create complementary-label dataset not included in libcll. Users can effortlessly generate custom dataset inherited from libcll.datasets.CLBaseDataset and redefine __get_item__() if needed.

import torch
from PIL import Image
import numpy as np
import torchvision.transforms as transforms
from libcll.datasets import CLBaseDataset
from torchvision.datasets import SVHN

train_set = SVHN(root="./data/svhn", split="train", download=True)
X_train = train_set.data
Y_train = torch.from_numpy(train_set.labels)
class CLSVHN(CLBaseDataset):
    def __getitem__(self, index):
        img, target = self.data[index], self.targets[index]
        img = Image.fromarray(np.transpose(img, (1, 2, 0)))
        transform = transforms.ToTensor()
        img = transform(img)
        return img, target
train_set = CLSVHN(
    X=X_train,
    Y=Y_train,
    num_classes=10
)
train_set.gen_complementary_target()

Generate complementary labels using transition matrix

libcll provides 4 types of commonly-used transition matrices for complementary-label generation from libcll.datasets.utils.get_transition_matrix(transition_matrix, num_classes). Notice that weak, strong, and noise transition matrices are designed specifically for datasets containing 10 classes. Users can generate complementary labels based on their desired distribution by passing transition matrix to CLBaseDataset.gen_complementary_target().

Transition matrix	Description
`uniform`	a uniform transition matrix where the diagonal elements are zero, and all non-diagonal elements are equal to \(\frac{1}{K - 1}\), \(K\) representing `num_classes`
`weak`	a biased transition matrix simulate milder deviation from uniform distribution, where the diagonal elements are zero and all non-diagonal elements randomly set.
`strong`	a biased transition matrix simulate stronger deviation from uniform distribution, where the diagonal elements are zero and all non-diagonal elements randomly set.
`noisy`	a noisy transition matrix where the diagonal elements are not necessary zero, and equals to \((1-\lambda)T_{\text{strong}}+\lambda\frac{1}{K}1_{K}\), \(\lambda\) representing the weight of noise.

from libcll.datasets import CLMNIST
from libcll.datasets.utils import get_transition_matrix

train_set = CLMNIST(root="./data/mnist", train=True)
transition_matrix = get_transition_matrix(
    transition_matrix="weak",
    num_classes=train_set.num_classes
)
train_set.gen_complementary_target(transition_matrix)

Multiple complementary-label dataset

libcll offers two types of multiple complementary-label learning settings by the parameter num_cl, which specifies the number of complementary labels for each instance. When set to zero, num_cl triggers random sampling of the number of complementary labels per data instance before actual complementary-label sampling.

Since each data has multiple complementary labels, batch decomposition is necessary before passing it to the learner. We provide two different collate function in libcll.datasets.utils for dataloader, collate_fn_multi_label duplicates image inputs to align with target lengths, while collate_fn_one_hot uses one-hot vectors to store multiple labels.

from torch.utils.data import random_split, DataLoader
from libcll.datasets import CLMNIST
from libcll.datasets.utils import collate_fn_multi_label

train_set = CLMNIST(root="./data/mnist", train=True)
test_set = CLMNIST(root="./data/mnist", train=False)
train_set.gen_complementary_target(num_cl=3)
input_dim = train_set.input_dim
num_classes = train_set.num_classes

batch_size = 256
train_set, valid_set = random_split(train_set, [0.9, 0.1])
train_loader = DataLoader(train_set, batch_size=batch_size, collate_fn=collate_fn_multi_label, shuffle=True, num_workers=4)
valid_loader = DataLoader(valid_set, batch_size=batch_size, collate_fn=collate_fn_multi_label, shuffle=False, num_workers=4)
test_loader = DataLoader(test_set, batch_size=batch_size, shuffle=False, num_workers=4)