Skip to main content

CT4Harmonization-Multicentric

The Cancer Imaging Archive

CT4Harmonization-Multicentric | A Multi-Centric Anthropomorphic 3D CT Phantom-Based Benchmark Dataset for Harmonization

DOI: 10.7937/m0pb-bh69 | Data Citation Required | 3 Views | Image Collection

Location Species Subjects Data Types Cancer Types Size Supporting Data Status
Liver Phantom Phantom 1 CT, SEG Hemangioma, Pathologically Benign, Metastatic disease 252.38GB Software/Source Code Public, Complete

Summary

Abstract


This collection introduces an open-source, anthropomorphic phantom-based dataset of CT scans for developing harmonization methods for deep learning based models. The phantom mimics human anatomy, allowing repeated scans without radiation delivery to real patients and isolating scanner effects by removing inter- and intra-patient variations. The dataset includes 268 image series from 13 scanners, 4 manufacturers, and 8 institutions, repeated 18-30 times at a 10 mGy dose using a harmonized protocol. An additional 1,378 image series were acquired with the same 13 scanners and  harmonized protocol but including additional acquisition doses. The presented phantom scans consist of three compartments from thorax, liver and test patterns. The 3D-printed liver includes three types of abnormal regions of interest, including two cysts, a metastasis, and a hemangioma, with ground truth segmentation masks that could be used for classification and segmentation. 

Introduction


Recent breakthroughs in data-driven algorithms and artificial intelligence (AI) applications in medical information processing have introduced tremendous potential for AI-assisted image-based personalized medicine that addresses tasks such as segmentation, diagnosis, and prognosis. However, these opportunities come with two challenges: large data requirements and consistency in data distribution. Machine and deep learning algorithms have extreme data demand, which is coupled with the high costs of data acquisition and annotation for a single observation (e.g., one event corresponds to one patient in a survival study). These challenges encourage pooling of data collected from multiple centers and scanners to achieve a critical mass of data for training models. However, pooling data from multiple centers introduces significant variability in the acquisition parameters and specifics of image reconstruction algorithms, leading to data domain shifts and inconsistencies in the collected data. The domain shift introduced by this variability in scanners reduces the value of merging data from multiple centers, reducing performance of predictive tasks such as segmentation, diagnosis, and prognosis, as well as in federated scenarios. Furthermore, domain shifts between training and test or inference data entails high risks of incorrect and uncontrolled predictions for treatment planning and personalized medicine when the inference is based on a scanner (and/or acquisition setting) that was not represented in the training data. Although this challenge applies to all medical imaging modalities, it is particularly important for computed tomography (CT) images due to the wide range of variability in manufacturers, acquisition parameters and dose, reconstruction algorithms, and customized parameter tunings in different centers.

This dataset provides the material to reproduce several different research works conducted in conjunction with it. Researchers can use this dataset for developing their own harmonization methods at both the image and feature levels to tackle the data drift problem from one scanner to another and across different manufacturers. We also release baseline performance metrics for the similarity of scans in the image domain and feature space without harmonization. This will set a baseline to evaluate the effectiveness of various harmonization techniques in the image and feature domains.

Methods


The following subsections provide information about how the data were selected, acquired and prepared for publication, approximate date range of imaging studies.

Data Acquisition 

Before the CT scans of the phantom were acquired, a survey was carried out to collect realistic acquisition and reconstruction parameter settings that are used in clinical thoracoabdominal CT scans for oncological staging, tumor search, and infectious focus detection in the portal venous contrast phase. The survey included 21 CT scanners from 9 centers across Switzerland. This translates to a tube voltage of 120 kV, a tube current-time product of 148 mAs, a pitch of 1.000, and a rotation time of 0.5 seconds for the Siemens SOMATOM Definition Edge scanner. The collimation was set to 38.4 mm, with a slice thickness/increment of 2.0 mm, and a pixel spacing of 1.367 mm. Due to vendor-specific limitations, the parameters mentioned above were slightly adapted to the closest possible parameters for each given scanner. The scans were repeated for 13 scanners from 4 manufacturers—Siemens, Philips, General Electric (GE), and Toshiba—at five dose levels (1 mGy, 3 mGy, 6 mGy, 10 mGy, 14 mGy). Only the tube current-time product (in mAs) was adjusted to set the various dose levels; all other parameters were kept the same. For each CT scanner and each dose level, 10 repeated scans (identified in the image series as  #1 to #10) with identical settings were performed, except inadvertently for the Toshiba Aquilion Prime SP scanner at 10 mGy (9 repeated scans). Thus, a total of 649 CT scans were performed.

Images were reconstructed using two or three different reconstruction algorithms per CT scan, resulting in two or three CT image series per CT scan. For all CT scans, a vendor-specific iterative reconstruction (IR) algorithm with a standard soft tissue kernel was used, resulting in 649 IR CT series. In addition, filtered backprojection (FBP) reconstruction with a standard soft tissue kernel was used for all CT scans, resulting in another 649 FBP CT series. For 2 of the 13 CT scanners, a DL based reconstruction algorithm was available. For one of these scanners, it was used for three dose levels (1 mGy, 3 mGy, 6 mGy), resulting in 30 additional CT series. For the second scanner, DL reconstruction was used for all five dose levels, resulting in 50 additional CT series. In summary, the dataset presented in this work consists of 1378 series reconstructed from 649 CT scans.

Data Analysis

The DICOM data files presented in conjunction with this repository did not undergo any preprocessing steps, in order to preserve all sources of variation—such as spatial shifts and voxel spacing differences introduced by various scanners. However, this repository is linked to a data descriptor paper where we thoroughly analyzed the data, as well as a Git repository that provides the code for resampling the scans to a uniform voxel spacing and performing registration.

Usage Notes


The presented dataset in this repository includes raw DICOM files with all acquisition parameters stored as DICOM tags, without any specific pre-processing. The data is organized into several folders corresponding to each scanner, with IDs from A1-H2 representing 8 institutions. Each scanner folder contains all image series reconstructed with different reconstruction methods, and each image series includes a folder containing the mask related to the various regions of interest in the liver tissue.

Data Access

Version 1: Updated

Title Data Type Format Access Points Subjects Studies Series Images License Metadata
Images and Segmentations CT, SEG DICOM
Download requires NBIA Data Retriever
1 1,378 2,756 467,224 CC BY 4.0 View
Related Datasets
No related Analysis Results found: Submit your proposal! No related Collections found
Legend: Analysis Results| Collections

Additional Resources for this Dataset

The code to preprocess and load the data from raw DICOM files is provided in the following Git repository: https://github.com/QA4IQI/qa4iqi.github.io
The code for data preparation, unifying the voxel spacing and performing registration, is provided here: https://github.com/medgift/Harmonization-Dataset

Citations & Data Usage Policy

Data Citation Required: Users must abide by the TCIA Data Usage Policy and Restrictions. Attribution must include the following citation, including the Digital Object Identifier:

Data Citation

Amirian, M., Bach, M., Jimenez del Toro, O. A., Aberle, C., Schaer, R., Andrearczyk, V., Maestrati, J.-F., Flouris, K., Obmann, M., Dromain, C., Dufour, B., Poletti, P.-A., von Tengg-Kobligk, H., Alkadhi, H., Konukoglu, E., Müller, H., Stieltjes, B., & Depeursinge, A. (2025). A Multi-Centric Anthropomorphic 3D CT Phantom-Based Benchmark Dataset for Harmonization (CT4Harmonization-Multicentric) (Version 1) [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/M0PB-BH69

Acknowledgements

This work was partly supported by the Swiss Personalized Health Network (SPHN) with the QA4IQI Quality assessment for interoperable quantitative computed tomography imaging project DMS2445 and the IMAGINE project.

Funding Sources

It was also partially supported by the Swiss National Science Foundation (SNSF, grants 325230_197477 and 205320_219430).

Related Publications

The authors recommended the following as the best source of additional information about this dataset:

No other publications were recommended by dataset authors.

Research Community Publications

TCIA maintains a list of publications that leveraged this dataset. If you have a manuscript you’d like to add please contact TCIA’s Helpdesk.

TCIA maintains a list of publications that leveraged this dataset. If you have a manuscript you’d like to add please contact TCIA’s Helpdesk.