Skip to main content

MIDI-B-Test-MIDI-B-Validation

The Cancer Imaging Archive

MIDI-B-Test-MIDI-B-Validation | Data in Support of the MIDI-B Challenge (MIDI-B-Synthetic-Validation, MIDI-B-Curated-Validation, MIDI-B-Synthetic-Test, MIDI-B-Curated-Test)

DOI: 10.7937/cf2p-aw56 | Data Citation Required | 18 Views | Image Collection

Location Species Subjects Data Types Cancer Types Size Supporting Data Status Updated
Various Human 538 Other, PT, CT, MR, SR, DX, MG, CR, US Various 71.07GB Software/Source Code Public, Complete 2025/05/02

Summary

Abstract


These resources comprise a large and diverse collection of multi-site, multi-modality, and multi-cancer clinical DICOM images from 538 subjects infused with synthetic PHI/PII in areas encountered by TCIA curation teams. Also provided is a TCIA-curated version of the synthetic dataset, along with mapping files for mapping identifiers between the two.

This new MIDI data resource includes DICOM datasets used in the Medical Image De-Identification Benchmark (MIDI-B) challenge at MICCAI 2024. They are accompanied by ground truth answer keys and a validation script for evaluating the effectiveness of medical image de-identification workflows. The validation script systematically assesses de-identified data against an answer key outlining appropriate actions and values for proper de-identification of medical images, promoting safer and more consistent medical image sharing.

Introduction


Medical imaging research increasingly relies on large-scale data sharing. However, reliable de-identification of DICOM images still presents significant challenges due to the wide variety of DICOM header elements and pixel data where identifiable information may be embedded. To address this, we have developed an openly accessible synthetic dataset containing artificially generated protected health information (PHI) and personally identifiable information (PII).

These resources complement our earlier work (Pseudo-PHI-DICOM-data ) hosted on The Cancer Imaging Archive. As an example of its use, we also provide a version curated by The Cancer Imaging Archive (TCIA) curation team. This resource builds upon best practices emphasized by the MIDI Task Group who underscore the importance of transparency, documentation, and reproducibility in de-identification workflows, part of the themes at recent conferences (Synapse:syn53065760) and workshops (2024 MIDI-B Challenge Workshop).

This framework enables objective benchmarking of de-identification performance, promotes transparency in compliance with regulatory standards, and supports the establishment of consistent best practices for sharing clinical imaging data. We encourage the research community to use these resources to enhance and standardize their medical image de-identification workflows.

Methods


Subject Inclusion and Exclusion Criteria

The source data were selected from imaging already hosted in de-identified form on TCIA. Imaging containing faces were excluded, and no new human studies were performed for his project.

Data Acquisition 

To build the synthetic dataset, image series were selected from TCIA’s curated datasets to represent a broad range of imaging modalities (CR, CT, DX, MG, MR, PT, SR, US) , manufacturers including (GE, Siemens, Varian , Confirma, Agfa, Eigen, Elekta, Hologic, KONICA MINOLTA, others) , scan parameters, and regions of the body. These were processed to inject the synthetic PHI/PII as described.

Data Analysis

Synthetic pools of PHI, like subject and scanning institution information, were generated using the Python package Faker (https://pypi.org/project/Faker/8.10.3/). These were inserted into DICOM metadata of selected imaging files using a system of inheritable rule-based templates outlining re-identification functions for data insertion and logging for answer key creation. Text was also burned-in to the pixel data of a number of images. By systematically embedding realistic synthetic PHI into image headers and pixel data, accompanied by a detailed ground-truth answer key, our framework enables users transparency, documentation, and reproducibility in de-identification practices, aligned with the HIPAA Safe Harbor method, DICOM PS3.15 Confidentiality Profiles, and TCIA best practices.

Usage Notes


This DICOM collection is split into two datasets, synthetic and curated. The synthetic dataset is the PHI/PII infused DICOM collection accompanied by a validation script and answer keys for testing, refining and benchmarking medical image de-identification pipelines. The curated dataset is a version of the synthetic dataset curated and de-identified by members of The Cancer Imaging Archive curation team. It can be used as a guide, an example of medical image curation best practices. For the purposes of the De-Identification challenge at MICCAI 2024, the synthetic and curated datasets each contain two subsets, a portion for Validation and the other for Testing

To link a curated dataset to the original synthetic dataset and answer keys, a mapping between the unique identifiers (UIDs) and patient IDs must be provided in CSV format to the evaluation software. We include the mapping files associated with the TCIA-curated set as an example. Lastly, for both the Validation and Testing datasets, an answer key in sqlite.db format is provided. These components are for use with the Python validation script linked below (4). Combining these components, a user developing or evaluating de-identification methods can ensure they meet a specification for successfully de-identifying medical image data.

Data Access

Version 1: Updated 2025/05/02

Title Data Type Format Access Points Subjects Studies Series Images License Metadata
Synthetic Validation Images (MIDI-B-Synthetic-Validation) PT, CT, MR, US, DX, SR, MG, CR DICOM
Download requires NBIA Data Retriever
216 241 280 23,921 CC BY 4.0 View
Synthetic Test Images (MIDI-B-Synthetic-Test) PT, CT, MR, SR, DX, MG, CR, US DICOM
Download requires NBIA Data Retriever
322 364 428 29,660 CC BY 4.0 View
Validation Answer Key Other SQLITE and ZIP
Download requires IBM-Aspera-Connect plugin
CC BY 4.0
Test Answer Key Other SQLITE and ZIP
Download requires IBM-Aspera-Connect plugin
CC BY 4.0
Curated Validation Images (MIDI-B-Curated-Validation) PT, CT, MR, SR, MG, DX, US, CR DICOM
Download requires NBIA Data Retriever
216 241 280 23,921 CC BY 4.0 View
Curated Test Images (MIDI-B-Curated-Test) PT, CT, MR, SR, CR, DX, US, MG DICOM
Download requires NBIA Data Retriever
322 364 428 29,656 CC BY 4.0 View
Validation Patient Mapping Other CSV CC BY 4.0
Validation UID Mapping Other CSV CC BY 4.0
Test Patient Mapping Other CSV CC BY 4.0
Test UID Mapping Other CSV CC BY 4.0
Related Datasets
No related Analysis Results found: Submit your proposal!
Pseudo-PHI-DICOM-Data
Legend: Analysis Results| Collections

Additional Resources for this Dataset

1. Previous dataset submission: https://www.cancerimagingarchive.net/collection/pseudo-phi-dicom-data/

2. Challenge website: https://www.synapse.org/Synapse:syn53065760/wiki/625274

3. Workshop website: https://wiki.nci.nih.gov/display/MIDI/2024+MIDI-B+Challenge+Workshop

4. Validation Script Code: https://github.com/CBIIT/MIDI_validation_script

Citations & Data Usage Policy

Data Citation Required: Users must abide by the TCIA Data Usage Policy and Restrictions. Attribution must include the following citation, including the Digital Object Identifier:

Data Citation

Rutherford, M. W., Nolan, T., Pei, L., Wagner, U., Pan, Q., Farmer, P., Smith, K., Kopchick, B., Laura Opsahl-Ong, Sutton, G., Clunie, D. A., Farahani, K., & Prior, F. (2025). Data in Support of the MIDI-B Challenge (MIDI-B-Synthetic-Validation, MIDI-B-Curated-Validation, MIDI-B-Synthetic-Test, MIDI-B-Curated-Test) (Version 1) [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/cf2p-aw56

Acknowledgements

We would like to acknowledge the National Cancer Institute for funding and actively participating in the project that generated the synthetic datasets being published here and the TCIA curation team, led by Tracy Nolan, MSc., who curated this data. The original data came from multiple institutions and multiple TCIA image collections.

Related Publications

The authors recommended the following as the best source of additional information about this dataset:

No other publications were recommended by dataset authors.

Research Community Publications

TCIA maintains a list of publications that leveraged this dataset. If you have a manuscript you’d like to add please contact TCIA’s Helpdesk.

TCIA maintains a list of publications that leveraged this dataset. If you have a manuscript you’d like to add please contact TCIA’s Helpdesk.