Skip to main content

BreastDCEDL-ISPY2

The Cancer Imaging Archive

BreastDCEDL-ISPY2 | Curated, Segmented, and Deep Learning-Optimized I-SPY 2 MRI Dataset for Prediction of pCR, HR, and HER2 Status

DOI: 10.7937/42wq-th78 | Data Citation Required | Analysis Result

Cancer Types Location Subjects Related Collections Size External Resources
Breast Cancer Breast 985 54GB Clinical

Abstract

The BreastDCEDL_ISPY2 dataset is a curated, deep learning–ready resource that integrates pretreatment 3D Dynamic Contrast-Enhanced MRI (DCE-MRI) scans from 982 breast cancer patients enrolled in the I-SPY2 TRIAL, sourced from The Cancer Imaging Archive (TCIA). Imaging data has been standardized from raw DICOM to 3D NIfTI volumes, preserving signal integrity and spatial resolution.

The dataset includes extensive non-imaging supporting data, such as tumor annotations, DICOM metadata, and demographics.

To facilitate reproducible research, fixed benchmark train/validation/test splits are provided, stratified by biomarker subtypes and response outcomes.

This dataset enables diverse research applications, including the development of deep learning models for predicting treatment response, radiomics-based analyses, and hormone receptor (HR) and HER2 status classification. It also facilitates benchmarking of advanced architectures such as Vision Transformers, and supports clinical translation efforts in the field of precision oncology

Introduction

Breast cancer remains one of the most prevalent causes of cancer-related mortality worldwide, and early detection coupled with accurate treatment response monitoring is essential for improving outcomes. Dynamic Contrast-Enhanced MRI (DCE-MRI) is a cornerstone modality for breast cancer imaging, offering unique insights into tumor vascularity, morphology, and treatment response. Despite its clinical importance, progress in computational and deep learning–based analysis of DCE-MRI has been hindered by the lack of large, standardized, and publicly available datasets.

The BreastDCEDL_ISPY2 dataset was created to address this gap by consolidating and harmonizing imaging and clinical data from the I-SPY2 TRIAL. With 982 patients across more than 22 institutions, it represents one of the largest publicly accessible collections of pre-treatment DCE-MRI scans for breast cancer. Importantly, the dataset includes standardized 3D NIfTI volumes, tumor annotations, voxel-based tumor volumes, and harmonized clinicopathologic metadata such as hormone receptor status, HER2 status, and pathologic complete response outcomes.

What makes BreastDCEDL_ISPY2 unique is its deep learning–ready structure and benchmark design. By providing consistent preprocessing, unified annotations, and predefined training/validation/test splits, the dataset enables reproducible research and direct comparison of computational methods. It lowers the technical barriers to working with heterogeneous MRI data, facilitates the development and validation of advanced machine learning models—including transformer-based architectures—and supports clinically relevant investigations into treatment response prediction and personalized therapy planning.

The dataset includes extensive non-imaging supporting data:

  • Tumor annotations include both segmentation masks and region-of-interest (ROI) delineations.
  • Accompanying DICOM metadata encompasses voxel dimensions, signal enhancement ratio (SER) time points, and contrast agent injection timestamps.
  • Clinical metadata provides comprehensive patient information, including demographic variables (age, race, menopausal status), hormone receptor (HR) and HER2 receptor status, as well as treatment outcomes, specifically pathologic complete response (pCR).

Methods

Subject Inclusion and Exclusion Criteria

The BreastDCEDL_ISPY2 dataset integrates patient data from the I-SPY2 TRIAL (2010–2016), yielding 985 patients with pretreatment DCE-MRI scans. Inclusion required at least three acquisitions (pre-contrast, early post-contrast, late post-contrast). Patients with incomplete imaging or missing essential metadata were excluded (3 cases), leaving 982 patients.

The cohort reflects a clinically diverse population, with a mean age of ~50 years, racial composition (majority White, ~17% Black, others underrepresented), and tumor subtypes spanning HR+/HER2−, HER2+, and triple-negative cancers. pCR status is available for the majority of patients. Treatment histories reflect standardized neoadjuvant chemotherapy protocols.

While the dataset includes multicenter acquisitions (22+ institutions), potential biases include predominance of U.S.-based populations, underrepresentation of some ethnic groups, and the trial setting, which may differ from community practice.

Data Acquisition

  • MRI Acquisition: Pretreatment 3D DCE-MRI acquired on 1.5T and 3T scanners. Protocols varied across institutions but consistently included pre-contrast, early post-contrast, and late post-contrast acquisitions after gadolinium administration. Key technical parameters (TR, TE, slice thickness, voxel size, FOV) are preserved in metadata.
  • Clinical Data: Captured through electronic trial databases. Variables include demographics (age, race, menopausal status), receptor status (HR, HER2), tumor volume, and treatment outcome (pCR).
  • Other Data: Signal Enhancement Ratio (SER) maps and voxel-based tumor volumes are provided.
  • Missing Data: 3 patients were excluded due to incomplete imaging or metadata.

Data Analysis

  • File Format Conversions: Raw DICOM images were converted into standardized 3D NIfTI volumes using a custom pipeline. Conversion preserved 16-bit dynamic range by storing as 64-bit floating-point data.
  • Manual Annotation and Segmentation Protocols: Tumor segmentations and ROI delineations provided by I-SPY2 radiologists; converted to binary 3D masks aligned to imaging volumes. Only the primary tumor was annotated if multiple tumors were present.
  • Quality Control and Validation: Tumor annotations were reviewed for alignment with MRI volumes. Consistency checks ensured tumor masks aligned across temporal phases. Patients with fewer than three valid acquisitions were excluded.
  • Scripts, Code, and Software Versions: Pipelines and Vision Transformer implementation are available on GitHub: https://github.com/naomifridman/BreastDCEDL

Usage Notes

Data Organization and Naming Conventions
All imaging data are provided in standardized 3D NIfTI format, converted from original DICOM files while preserving full signal integrity. File names follow the structure:
<patientID>_acq<acquisitionNumber>.nii.gz
where patientID corresponds to the anonymized subject identifier, in the I-SPY-2 trial, and acquisitionNumber reflects temporal ordering (e.g., pre-contrast, early post-contrast, late post-contrast). The dataset’s structure and organization are illustrated in Figure 1. To facilitate validation and exploration of the MRI data, we provide a mapping table linking the raw DICOM slices to the 3D NIfTI planes in the project’s Git repository, and a comprehensive DICOM tag dictionary.

Training, Validation, and Test Groupings
The dataset includes fixed benchmark partitions to facilitate reproducible research. These consist of:

  • Training set: 784 patients (32.1% pCR rate).
  • Validation set: 99 patients (32.3% pCR rate).
  • Test set: 99 patients (32.3% pCR rate).

Partitioning was stratified by biomarker status (HR, HER2) and pCR outcomes to ensure balanced distributions across subsets. Users are encouraged to adopt these predefined splits when developing predictive models to enable fair comparisons across studies.

Clinical Data Files
Clinical and pathologic metadata are distributed in standardized TSV format. Variables include demographics (age, race, menopausal status), biomarker status (HR, HER2), tumor volume, and pCR outcomes. TSV files can be opened with standard spreadsheet software (e.g., Microsoft Excel, LibreOffice Calc) or programmatically accessed using Python (pandas) or R.

Software Recommendations

  • NIfTI images: Compatible with common medical imaging platforms such as 3D Slicer, ITK-SNAP, and FSL. Python users may rely on nibabel for loading and handling imaging volumes.
  • Segmentation masks: Provided as binary 3D NIfTI volumes (1 = tumor, 0 = background), directly loadable in the same software.

Potential Sources of Error or Variability

  • Inter-cohort heterogeneity: Imaging protocols (field strength, TR/TE, slice thickness) varied across centers, potentially introducing site effects.
  • Only the largest lesion was annotated for multifocal disease.
  • Population bias (predominantly U.S., underrepresentation of minorities).

External Resources

The source code for converting MRI data from DICOM to NIfTI format, along with usage examples, is available in the project’s GitHub repository: https://github.com/naomifridman/BreastDCEDL.

Data Access

Version 1: Updated

Title Data Type Format Access Points Subjects Studies Series Images License Metadata
Images and Segmentations MR, Segmentation NIFTI
Download requires IBM-Aspera-Connect plugin
982 8,021 CC BY 4.0
Clinical Data Molecular Test, Measurement, Demographic, Diagnosis, Treatment TSV 982 CC BY 4.0
Medical Data Diagnosis, Follow-Up, Demographic, Treatment, Molecular Test TSV 982 CC BY 4.0
File Sizes Other TSV 982 CC BY 4.0
Data Dictionary Other TSV 982 CC BY 4.0

Collections Used In This Analysis Result

Title Data Type Format Access Points Subjects Studies Series Images License Metadata
Images I-SPY2 Imaging Cohort 1 dataset, (719 I-SPY2 cases plus 266 ACRIN-6698 cases) MR, SEG DICOM 985 3,677 43,356 7,575,549 CC BY 4.0 View

Collections Used In This Analysis Result

Related Collections
Related Datasets
ISPY2
No related Analysis Results found: Submit your proposal!
Legend: Collections| Analysis Results

Citations & Data Usage Policy

Data Citation Required: Users must abide by the TCIA Data Usage Policy and Restrictions. Attribution must include the following citation, including the Digital Object Identifier:

Data Citation

Fridman, N., & Goldstein, A. (2025). Curated, Segmented, and Deep Learning-Optimized I-SPY 2 MRI Dataset for Prediction of pCR, HR, and HER2 Status (BreastDCEDL-ISPY2) (Version 1) [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/42WQ-TH78

 

Acknowledgements

The research for which the dataset was created has been funded by the Israeli Ministry of Science & Technology (No. 5463)

Related Publications

The authors recommended the following as the best source of additional information about this dataset:

Publication Citation

Fridman, N., Solway, B., Fridman, T., Barnea, I., & Goldstein, A. (2025). BreastDCEDL: A comprehensive breast cancer DCE-MRI dataset and transformer implementation for treatment response prediction. arXiv preprint arXiv:2506.12190. https://arxiv.org/abs/2506.12190

No other publications were recommended by dataset authors.

Research Community Publications

TCIA maintains a list of publications that leveraged this dataset. If you have a manuscript you’d like to add please contact TCIA’s Helpdesk.

Publications Using This Data

TCIA maintains a list of publications which leverage our data. If you have a manuscript you’d like to add please contact TCIA’s Helpdesk.

Publication Citation

Fridman, N., Solway, B., Fridman, T., Barnea, I., & Goldstein, A. (2025). BreastDCEDL: A comprehensive breast cancer DCE-MRI dataset and transformer implementation for treatment response prediction. arXiv preprint arXiv:2506.12190. https://arxiv.org/abs/2506.12190