The Project

Multimodal approach on Prostate data generation

Prostate cancer is one of the most frequently diagnosed cancers worldwide; however, identifying clinically significant cases often requires invasive diagnostic procedures. With the development of data-driven solutions, several alternatives are being developed to provide a diagnosis that can reduce the number of biopsies. However, employing patient data raises many privacy concerns, as health-related data is very sensitive.

Two T2-weighted MRI slices. One is real and the other has been artificially generated. Can you guess which is which?

Moreover, much of the compiled data may be uncompensated due to its nature, as most diagnoses identify indolent tumors. Clinically significant prostate cancers, which are more likely to lead to adverse outcomes, are relatively rare, thus making it desirable to have more of this data to properly train models to detect such cases when they occur.

An emerging solution to deal with this situation is the generation of synthetic data. Researchers from Universitat Politècnica de Catalunya (UPC) are developing methods to generate synthetic prostate data as a multimodal approach: firstly, generating tabular data in a partial or full way, that doesn’t belong to a specific patient but representing the overall cohort of patients; and secondly, generating 2D and 3D MRI images focusing on the controllability of its parameters, such as the aggressiveness of the cancer, using Generative Adversarial Networks (GANs). Furthermore, it is important to note that both fronts reduce the risk of data leakage, enable data collection without additional invasive procedures for new patients, and aim at data augmentation to improve the performance of models, such as those for prostate cancer detection.

Partial synthetic tabular data generation for enhanced privacy in healthcare 

This front of UPC work in the FLUTE project focuses on the generation of tabular data relevant for prostate cancer diagnosis, introducing a novel methodology that leverages data visualization tools to enhance the generation of synthetic data while addressing privacy concerns.

Specifically, new partially synthetic data generation (PSDG) algorithms are being developed for tabular data using non-traditional approaches like visualization techniques. In federating data from diverse sources or institutions, the number of features under consideration often varies, typically leading to information being discarded to reach a feature consensus. PSDG, however, enables a reverse approach by expanding smaller datasets through knowledge derived from larger ones.

This proposed methodology effectively imputes missing data and generates synthetic samples, surpassing the number of incomplete data entries. Additionally, it provides a secure framework for data augmentation, enabling the use of data from multiple centres without requiring the transfer of sensitive information.

Moreover, in the current situation, validation of the generated synthetic data is obtained using indirect methods. That is, generated synthetic data is valid as much as accuracy or precision is increased on the augmented database. Our proposal deal with both, direct statistical validation methods, and new validation measures that focus on privacy, coverage and distance to the original data.

Towards the controlability of MRI sampling to improve clinically significant Prostate Cancer detection and classification

Our main goal is to address the scarcity of prostate MRI and offer a greater volume of images for analysis and model training. However, this task must be carried out with an emphasis on ensuring high-quality imaging and maintaining control over the sampling process to effectively compensate for the likely imbalance in the dataset, which is essential for subsequent tasks such as tumour detection and classification.

Realistic MRIs are currently generated with Generative Adversarial Networks (GANs). An slice from a T2w MRI volume generated with 3D-StyeleGAN2-ADA is shown in the Figure. The first image corresponds to a real sample, while the second is synthetic. Furthermore, we use GANs to synthesize prostate MRIs and its lesion segmentation simultaneously, while focusing on enhancing latent space control and interpretability. Principal Component Analysis (PCA) is applied to uncover key factors impacting image quality and lesion representation. This combination of methods identifies anatomical changes and lesion growth directions, allowing controlled sampling to generate images with varying levels of lesion presence.

Moreover, to ensure that the MRI samples meet the previously outlined requirements, in addition to state of the art metrics for assessing the quality of generated images, such as FID or KID, expert supervision will be incorporated through surveys aimed at radiologists, allowing them to assess whether the generated images can be distinguished from real ones.

In parallel, deep learning-based CAD algorithms for segmentation and classification are being developed. To provide further detail, the system uses an nnU-Net model for prostate segmentation and incorporates an integrated classification head for lesion classification.

This system is currently being trained using real prostate MRI data; future stages aim to train it using generated samples and compare the outcomes, thus providing an indirect quality measure.

Montse Pardas Feliu, Cecilio Angulo Bahon,Veronica Vilaplana Besler

Immagini

Two T2-weighted MRI slices. One is real and the other has been artificially generated. Can you guess which is which?