To read this content please select one of the options below:

On the differences between CNNs and vision transformers for COVID-19 diagnosis using CT and chest x-ray mono- and multimodality

Sara El-Ateif (Software Project Management Research Team, ENSIAS, Mohammed V University, Rabat, Morocco)
Ali Idri (Software Project Management Research Team, ENSIAS, Mohammed V University, Rabat, Morocco) (Mohammed VI Polytechnic University, Ben Guerir, Morocco)
José Luis Fernández-Alemán (Informatica y Sistemas, Universidad de Murcia, Murcia, Spain)

Data Technologies and Applications

ISSN: 2514-9288

Article publication date: 10 January 2024

56

Abstract

Purpose

COVID-19 continues to spread, and cause increasing deaths. Physicians diagnose COVID-19 using not only real-time polymerase chain reaction but also the computed tomography (CT) and chest x-ray (CXR) modalities, depending on the stage of infection. However, with so many patients and so few doctors, it has become difficult to keep abreast of the disease. Deep learning models have been developed in order to assist in this respect, and vision transformers are currently state-of-the-art methods, but most techniques currently focus only on one modality (CXR).

Design/methodology/approach

This work aims to leverage the benefits of both CT and CXR to improve COVID-19 diagnosis. This paper studies the differences between using convolutional MobileNetV2, ViT DeiT and Swin Transformer models when training from scratch and pretraining on the MedNIST medical dataset rather than the ImageNet dataset of natural images. The comparison is made by reporting six performance metrics, the Scott–Knott Effect Size Difference, Wilcoxon statistical test and the Borda Count method. We also use the Grad-CAM algorithm to study the model's interpretability. Finally, the model's robustness is tested by evaluating it on Gaussian noised images.

Findings

Although pretrained MobileNetV2 was the best model in terms of performance, the best model in terms of performance, interpretability, and robustness to noise is the trained from scratch Swin Transformer using the CXR (accuracy = 93.21 per cent) and CT (accuracy = 94.14 per cent) modalities.

Originality/value

Models compared are pretrained on MedNIST and leverage both the CT and CXR modalities.

Keywords

Acknowledgements

The authors would like to express their gratitude for the support provided by the Google Ph.D. Fellowship. This research is part of the OASSIS-UMU (PID2021-122554OB-C32) project (supported by the Spanish Ministry of Science and Innovation). This project is also funded by the European Regional Development Fund (ERDF).

Citation

El-Ateif, S., Idri, A. and Fernández-Alemán, J.L. (2024), "On the differences between CNNs and vision transformers for COVID-19 diagnosis using CT and chest x-ray mono- and multimodality", Data Technologies and Applications, Vol. ahead-of-print No. ahead-of-print. https://doi.org/10.1108/DTA-01-2023-0005

Publisher

:

Emerald Publishing Limited

Copyright © 2023, Emerald Publishing Limited

Related articles