BreastDCEDL-ISPY2 | Curated, Segmented, and Deep Learning-Optimized I-SPY 2 MRI Dataset for Prediction of pCR, HR, and HER2 Status
DOI: 10.7937/42wq-th78 | Data Citation Required | Analysis Result
| Location | Subjects | Size | |||
|---|---|---|---|---|---|
| Breast Cancer | Breast | 985 | Clinical |
The BreastDCEDL_ISPY2 dataset is a curated, deep learning–ready resource that integrates pretreatment 3D Dynamic Contrast-Enhanced MRI (DCE-MRI) scans from 982 breast cancer patients enrolled in the I-SPY2 TRIAL, sourced from The Cancer Imaging Archive (TCIA). Imaging data has been standardized from raw DICOM to 3D NIfTI volumes, preserving signal integrity and spatial resolution. The dataset includes extensive non-imaging supporting data, such as tumor annotations, DICOM metadata, and demographics. To facilitate reproducible research, fixed benchmark train/validation/test splits are provided, stratified by biomarker subtypes and response outcomes. This dataset enables diverse research applications, including the development of deep learning models for predicting treatment response, radiomics-based analyses, and hormone receptor (HR) and HER2 status classification. It also facilitates benchmarking of advanced architectures such as Vision Transformers, and supports clinical translation efforts in the field of precision oncology Breast cancer remains one of the most prevalent causes of cancer-related mortality worldwide, and early detection coupled with accurate treatment response monitoring is essential for improving outcomes. Dynamic Contrast-Enhanced MRI (DCE-MRI) is a cornerstone modality for breast cancer imaging, offering unique insights into tumor vascularity, morphology, and treatment response. Despite its clinical importance, progress in computational and deep learning–based analysis of DCE-MRI has been hindered by the lack of large, standardized, and publicly available datasets. The BreastDCEDL_ISPY2 dataset was created to address this gap by consolidating and harmonizing imaging and clinical data from the I-SPY2 TRIAL. With 982 patients across more than 22 institutions, it represents one of the largest publicly accessible collections of pre-treatment DCE-MRI scans for breast cancer. Importantly, the dataset includes standardized 3D NIfTI volumes, tumor annotations, voxel-based tumor volumes, and harmonized clinicopathologic metadata such as hormone receptor status, HER2 status, and pathologic complete response outcomes. What makes BreastDCEDL_ISPY2 unique is its deep learning–ready structure and benchmark design. By providing consistent preprocessing, unified annotations, and predefined training/validation/test splits, the dataset enables reproducible research and direct comparison of computational methods. It lowers the technical barriers to working with heterogeneous MRI data, facilitates the development and validation of advanced machine learning models—including transformer-based architectures—and supports clinically relevant investigations into treatment response prediction and personalized therapy planning. The dataset includes extensive non-imaging supporting data: The BreastDCEDL_ISPY2 dataset integrates patient data from the I-SPY2 TRIAL (2010–2016), yielding 985 patients with pretreatment DCE-MRI scans. Inclusion required at least three acquisitions (pre-contrast, early post-contrast, late post-contrast). Patients with incomplete imaging or missing essential metadata were excluded (3 cases), leaving 982 patients. The cohort reflects a clinically diverse population, with a mean age of ~50 years, racial composition (majority White, ~17% Black, others underrepresented), and tumor subtypes spanning HR+/HER2−, HER2+, and triple-negative cancers. pCR status is available for the majority of patients. Treatment histories reflect standardized neoadjuvant chemotherapy protocols. While the dataset includes multicenter acquisitions (22+ institutions), potential biases include predominance of U.S.-based populations, underrepresentation of some ethnic groups, and the trial setting, which may differ from community practice. Data Organization and Naming Conventions Training, Validation, and Test Groupings Partitioning was stratified by biomarker status (HR, HER2) and pCR outcomes to ensure balanced distributions across subsets. Users are encouraged to adopt these predefined splits when developing predictive models to enable fair comparisons across studies. Clinical Data Files Software Recommendations Potential Sources of Error or Variability The source code for converting MRI data from DICOM to NIfTI format, along with usage examples, is available in the project’s GitHub repository: https://github.com/naomifridman/BreastDCEDL.Abstract
Introduction
Methods
Subject Inclusion and Exclusion Criteria
Data Acquisition
Data Analysis
Usage Notes
All imaging data are provided in standardized 3D NIfTI format, converted from original DICOM files while preserving full signal integrity. File names follow the structure:
<patientID>_acq<acquisitionNumber>.nii.gz
where patientID corresponds to the anonymized subject identifier, in the I-SPY-2 trial, and acquisitionNumber reflects temporal ordering (e.g., pre-contrast, early post-contrast, late post-contrast). The dataset’s structure and organization are illustrated in Figure 1. To facilitate validation and exploration of the MRI data, we provide a mapping table linking the raw DICOM slices to the 3D NIfTI planes in the project’s Git repository, and a comprehensive DICOM tag dictionary.
The dataset includes fixed benchmark partitions to facilitate reproducible research. These consist of:
Clinical and pathologic metadata are distributed in standardized TSV format. Variables include demographics (age, race, menopausal status), biomarker status (HR, HER2), tumor volume, and pCR outcomes. TSV files can be opened with standard spreadsheet software (e.g., Microsoft Excel, LibreOffice Calc) or programmatically accessed using Python (pandas) or R.
External Resources
Data Access
Version 1: Updated
| Title | Data Type | Format | Access Points | Subjects | License | Metadata | |||
|---|---|---|---|---|---|---|---|---|---|
| Images and Segmentations | MR, Segmentation | NIFTI | Download requires IBM-Aspera-Connect plugin |
982 | 8,021 | CC BY 4.0 | — | ||
| Clinical Data | Molecular Test, Measurement, Demographic, Diagnosis, Treatment | TSV | 982 | CC BY 4.0 | — | ||||
| Medical Data | Diagnosis, Follow-Up, Demographic, Treatment, Molecular Test | TSV | 982 | CC BY 4.0 | — | ||||
| File Sizes | Other | TSV | 982 | CC BY 4.0 | — | ||||
| Data Dictionary | Other | TSV | 982 | CC BY 4.0 | — |
Collections Used In This Analysis Result
| Title | Data Type | Format | Access Points | Subjects | License | Metadata | |||
|---|---|---|---|---|---|---|---|---|---|
| Images I-SPY2 Imaging Cohort 1 dataset, (719 I-SPY2 cases plus 266 ACRIN-6698 cases) | MR, SEG | DICOM | Requires NBIA Data Retriever |
985 | 3,677 | 43,356 | 7,575,549 | CC BY 4.0 | View |
Citations & Data Usage Policy
Data Citation Required: Users must abide by the TCIA Data Usage Policy and Restrictions. Attribution must include the following citation, including the Digital Object Identifier:
Data Citation |
|
|
Fridman, N., & Goldstein, A. (2025). Curated, Segmented, and Deep Learning-Optimized I-SPY 2 MRI Dataset for Prediction of pCR, HR, and HER2 Status (BreastDCEDL-ISPY2) (Version 1) [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/42WQ-TH78
|
Acknowledgements
The research for which the dataset was created has been funded by the Israeli Ministry of Science & Technology (No. 5463)
Related Publications
Publications by the Dataset Authors
The authors recommended the following as the best source of additional information about this dataset:
Publication Citation |
|
|
Fridman, N., Solway, B., Fridman, T., Barnea, I., & Goldstein, A. (2025). BreastDCEDL: A comprehensive breast cancer DCE-MRI dataset and transformer implementation for treatment response prediction. arXiv preprint arXiv:2506.12190. https://arxiv.org/abs/2506.12190 |
Research Community Publications
TCIA maintains a list of publications that leveraged this dataset. If you have a manuscript you’d like to add please contact TCIA’s Helpdesk.
