Agreement Across 10 Artificial Intelligence Models in Assessing Human Epidermal Growth Factor Receptor 2 (HER2) Expression in Breast Cancer Whole-Slide Images
- Author(s)
- McKelvey, B; Torres-Saavedra, PA; Li, J; Broeckx, G; Deman, F; Ali, S; Andrews, HS; Arslan, S; Azulay, M; Balasubramanian, S; Barrett, JC; Caie, P; Chen, M; Cohen, D; Dasgupta, T; Fahrer, D; Green, G; Gustavson, M; Hersey, S; Hidalgo-Sastre, A; Jiwani, S; Joseph, E; Jung, W; Kulig, K; Kushnarev, V; Lennerz, JK; Li, X; Lodge, M; Mancuso, J; Montalto, M; Mukhopadhyay, S; Ntelemis, F; Oberley, M; Pandya, P; Puig, O; Richardson, ET; Sarachakov, A; Stewart, M; McShane, LM; Salgado, R; Allen, J;
- Journal Title
- Modern Pathology
- Publication Type
- Research article
- Abstract
- Historically, eligibility for receiving human epidermal growth factor receptor 2 (HER2) targeted therapies was limited to HER2-positive tumors (IHC 3+ or ISH-amplified), but recent advances in antibody-drug conjugates (ADCs) have expanded these criteria to include HER2-low and HER2-ultralow expression. This evolving therapeutic landscape underscores the need for precise and reproducible HER2 assessment. Digital and computational pathology tools may help address these needs, but their measurement variability must be evaluated to inform research and clinical use. We evaluated HER2 scoring variability across 10 independently developed computational pathology artificial intelligence (AI) models applied to 1,124 whole slide images from 733 patients with breast cancer. Analyses included ASCO-CAP categorical scores (0, 1+, 2+, 3+), H-scores, tumor cell staining percentages, and counts of total and stained invasive carcinoma cells. Agreement among models and three pathologists was assessed using pairwise overall percent agreement (OPA), Cohen's kappa, and hierarchical clustering. Median model pairwise OPA for categorical HER2 scores was 65.1% (kappa 0.51). Agreement was highest for HER2 3+ versus not 3+ (OPA 97.3%, kappa 0.86) and lowest for HER2-low cases, reflecting existing measurement challenges. For HER2 0 (negative) vs. not 0 (positive) scoring, the average negative agreement (ANA) was 65.3%, compared to the average positive agreement (APA) of 91.3%, suggesting more agreement in non-HER2 0 scores. H-score and cell count analyses indicated that scoring differences were more related to staining interpretation than tumor cell detection. Pathologists showed numerically higher concordance than models, but interobserver variability persisted. In exploratory analyses, sample type, staining artifacts, and heterogeneous HER2 expression appeared to be associated with discrepancies. AI-based HER2 scoring demonstrated high agreement in identifying HER2 3+ cases. Variability was most pronounced in borderline HER2 categories, particularly in HER2-low, underscoring the need for continued tool refinement for handling low intensity staining. Standardized training datasets, validation frameworks, and regulatory alignment are important to improve reproducibility. Developing reference standards and benchmarking datasets is critical to evaluate performance, support regulatory decision-making, and ensure real-world applicability.
- Keywords
- HER2 scoring; artificial intelligence; breast cancer; computational pathology; whole slide imaging
- Department(s)
- Laboratory Research
- Publisher's Version
- https://doi.org/10.1016/j.modpat.2025.100944
- Terms of Use/Rights Notice
- Refer to copyright notice on published article.
Creation Date: 2026-01-23 11:58:21
Last Modified: 2026-01-23 12:00:54