AMIA NLP Working Group · Dataset Standardization Initiative

Clinical & Biomedical NLP Datasets
Landscape Overview

60Datasets Catalogued
33Metadata Fields
10Domain Categories
2011–2025Publication Span
PART I
Corpus Landscape
Dataset Domain Categories
Classified from free-text domain field (n = 60)
Access & Use Conditions
Proportion by access tier (n = 60)
Hosting Repository Distribution
Where datasets are hosted or mirrored (n = 60)
Dataset Release / Last Update by Year
Notable growth post-2022; 2025 reflects active curation
PART II
Task & Modality Structure
Task Composition within Each Data Modality
100% stacked bar showing primary task breakdown per data modality. Radiology skews toward NER + RE; EHR text covers the broadest task range.
Primary NLP Task Distribution
Task type extracted from annotation field; multi-task datasets classified as "Multi-task / Other" (n = 60)
PART III
Language & Coverage
Language Coverage
English dominates; only 8 datasets include non-English text, exposing a multilingual gap in clinical NLP (n = 60)
LLM Training Data Leakage Risk
Classification based on known or suspected use in LLM pre-training corpora; affects benchmark validity (n = 60)
PART III
Task · Modality · Granularity Flow
Task → Modality → Annotation Granularity
Flow diagram showing how dataset NLP tasks (left) map to data modalities (center) and annotation granularity levels (right). Width of each band is proportional to number of datasets (n = 60).
PART IV
FAIR Metadata Compliance
Dataset × Metadata Field Compliance Matrix
60 datasets (rows, sorted by overall FAIR score ↓) × 16 metadata fields (columns, grouped by FAIR principle). Each cell = heuristic quality score (0–1): full credit for quantified or standardized values, partial for vague entries, zero for missing. Hover for details.
Complete (1.0)
Good (0.75)
Partial (0.5)
Weak (0.25)
Missing (0)
AMIA NLP Working Group · Clinical & Biomedical NLP Dataset Catalog · FAIR-Aligned Metadata Framework · 2025