TCE-2023-08-1046.R1_DATASETS

Citation Author(s):
Asitha
Kottahachchi kankanamge Don
RMIT University
Submitted by:
Asitha Kottahac...
Last updated:
Sat, 12/09/2023 - 00:37
DOI:
10.21227/5cr5-0204
Data Format:
License:
0
0 ratings - Please login to submit your rating.

Abstract 

In medical applications, machine learning often grapples with limited training data. Classical self-supervised deep learning techniques have been helpful in this domain, but these algorithms have yet to achieve the required accuracy for medical use. Recently quantum algorithms show promise in handling complex patterns with small datasets. To address this challenge, this study presents a novel solution that combines self-supervised learning with Variational Quantum Classifiers (VQC) and utilizes Principal Component Analysis (PCA) as the dimensionality reduction technique. This unique approach ensures generalization even with a small training dataset while preserving data privacy, a vital consideration in medical applications. PCA is effectively utilized for dimensionality reduction, enabling VQC to operate with just 2 Q-bits, overcoming current quantum hardware limitations, and gaining an advantage over classical methods. In this study, four medical datasets (PneumoniaMNIST, BreastMNIST, PathMNIST, ChestMNIST) and two non-medical datasets (Hymenoptera Ant & Bees, Kaggle Cats, and Dogs Dataset) were employed. During the self-supervised learning stage, we applied supervised contrastive learning to the above datasets, resulting in the creation of 2048-feature dimension datasets for each dataset. Subsequently, the 2048 feature dataset underwent data preprocessing steps and principal component analysis, yielding two feature datasets for each 2048 feature dataset. The comprehensive dataset comprises six sets of 2048 features and six sets of two features. The final two-feature dataset was utilized in conjunction with the variational quantum classifier.

Instructions: 

Each of the 2048-feature datasets includes data columns ranging from f1 to f2048, accompanied by a 'y' column denoting data labels, which are binary values of 0 or 1. Similarly, the 2-feature datasets consist of columns f1, f2, and a 'y' column representing data labels, with values of 0 or 1. All twelve datasets, comprising six with 2048 features each and six with 2 features each, consist of a total of 120 samples.

Funding Agency: 
Australian Research Council
Grant Number: 
Discovery Project-DP210102761