PART I
Corpus Landscape
Fig. 1 · Domain
Dataset Domain Categories
Classified from free-text domain field (n = 60)
Fig. 2 · Access
Access & Use Conditions
Proportion by access tier (n = 60)
Fig. 3 · Repository
Hosting Repository Distribution
Where datasets are hosted or mirrored (n = 60)
Fig. 4 · Timeline
Dataset Release / Last Update by Year
Notable growth post-2022; 2025 reflects active curation
PART II
Task & Modality Structure
Fig. 5 · Task × Modality
Task Composition within Each Data Modality
100% stacked bar showing primary task breakdown per data modality. Radiology skews toward NER + RE; EHR text covers the broadest task range.
Fig. 6 · NLP Tasks
Primary NLP Task Distribution
Task type extracted from annotation field; multi-task datasets classified as "Multi-task / Other" (n = 60)
PART III
Language & Coverage
Fig. 7 · Language
Language Coverage
English dominates; only 8 datasets include non-English text, exposing a multilingual gap in clinical NLP (n = 60)
Fig. 8 · LLM Leakage
LLM Training Data Leakage Risk
Classification based on known or suspected use in LLM pre-training corpora; affects benchmark validity (n = 60)
PART III
Task · Modality · Granularity Flow
Fig. 9 · Sankey
Task → Modality → Annotation Granularity
Flow diagram showing how dataset NLP tasks (left) map to data modalities (center) and annotation granularity levels (right). Width of each band is proportional to number of datasets (n = 60).
PART IV
FAIR Metadata Compliance
Fig. 10 · FAIR Heatmap
Dataset × Metadata Field Compliance Matrix
60 datasets (rows, sorted by overall FAIR score ↓) × 16 metadata fields (columns, grouped by FAIR principle). Each cell = heuristic quality score (0–1): full credit for quantified or standardized values, partial for vague entries, zero for missing. Hover for details.
Complete (1.0)
Good (0.75)
Partial (0.5)
Weak (0.25)
Missing (0)