Machine Learning–Ready Dataset

Curated Repository Of Well-resolved Non-covalent Interactions

C · R · O · W · N

153,005 protein–ligand complexes, rigorously preprocessed and energy-minimized. Bridging the gap between structural quality and chemical diversity for next-generation ML models.

153k
Curated complexes
12,352
Unique proteins
26,746
Unique ligands
3,209
Species represented
Motivation

Existing databases force a trade-off between quality and scale

Machine learning models for protein–ligand interactions need both structural reliability and broad chemical coverage. Current resources offer one or the other — never both.

01

Curated but narrow

Databases like PDBBind and HiQBind offer carefully checked structures but cover only a fraction of the PDB. With a few tens of thousands of entries each, they under-represent the true diversity of known protein–ligand interactions, limiting the generalization of models trained on them.

02

Broad but noisy

Large-scale resources like PLInder catalogue nearly 650,000 systems across the entire PDB, but apply minimal quality filtering. Unresolved atoms, steric clashes, incorrect bond assignments, and missing quality annotations introduce systematic noise into training data.

03

Affinity-centric bias

Many benchmarks are organized around experimentally measured binding affinities, which cover only a subset of structures and introduce heterogeneous assay conditions. This excludes thousands of informative structures for which no affinity data exist.

solution

CROWN bridges this gap

A fully automated preprocessing pipeline retains PLInder's broad coverage while enforcing rigorous structural standards — including constrained energy minimization, a step absent from all other available datasets.

Pipeline

From 649,915 systems to 153,005 curated complexes

CROWN applies five quality filters interleaved with two structural processing stages, each targeting a specific aspect of structural reliability — from crystallographic resolution to post-minimization stability.

CROWN preprocessing pipeline attrition funnel diagram

Preprocessing pipeline for CROWN, illustrated as a staged attrition funnel. Starting from 649,915 protein–ligand interaction systems sourced from PLInder, the pipeline applies five quality filters (beige) addressing structure, ligand, pocket, interaction, and stability quality, and two structural processing stages (blue) performing automated corrections and constrained energy minimization at pH 7.4.

Key Features

What distinguishes CROWN

Every complex has been quality-filtered, structurally repaired, protonated at physiological pH, and energy-minimized with custom restraints — producing a structurally uniform collection ready for machine learning.

QUALITY

Verified electron density

Every entry has validated RSR (< 0.3) and RSCC (> 0.8) scores. No structures with missing quality annotations — a blind spot present in all other datasets, including PDBBind and HiQBind.

QUALITY

Drug-like, non-covalent only

Ions, crystallization artifacts, and covalently bound ligands are removed. Ligands contain 10–100 heavy atoms with > 10 protein contacts, ensuring only meaningfully engaged binding poses are retained.

PROCESSING

Automated structural repair

Missing atoms rebuilt, alternate conformers resolved, steric clashes removed, and broken bonds repaired — all with a 99.99% success rate across nearly 190,000 systems.

PROCESSING

Constrained energy minimization

A custom flat-bottomed tethering potential allows atoms to relax within crystallographic uncertainty (0.25 Å) while preserving experimental geometry. This step is unique to CROWN among all available datasets.

DESIGN

Geometry-centric philosophy

CROWN treats the 3D interaction geometry as the primary source of information — not binding affinities. This avoids affinity-label bias and includes thousands of structures with no reported Kd or IC50.

DESIGN

Broad chemical coverage

~4× more protein and species diversity than PDBBind or HiQBind, with 26,746 unique ligand types and 13,523 Murcko scaffolds — including PROTACs, macrocycles, and larger drug-like molecules.

Comparison

CROWN in the landscape of protein–ligand databases

A side-by-side comparison of dataset scope, ligand diversity, and structural quality across five widely used resources.

PropertyPDBBindHiQBindBioLiP2PLInderCROWN
Dataset scope
Total entries19,44931,57386,458649,915153,005
Unique PDB-CCD pairs17,75817,24722,720201,83664,502
Unique PDB IDs17,75817,08817,701111,86755,208
Unique UniProt IDs3,3542,64214,93322,24312,352
Unique CATH IDs5834431,0171,565976
Unique species8617153,4664,8823,209
Affinity annotations
Ligand diversity
Unique CCD IDs13,95612,4286,51947,30026,746
Unique Murcko scaffolds8,2517,6603,21322,74313,523
Oligo ligands2,6326762,71136,37910,572
Ion ligands001,80522,7280
Covalent ligands870241,70832,2760
Artifact ligands153452318,6260
Structure issues
Missing bonds9901142,10437,4220
Steric overlaps3233136565,6700
Unresolved ligand atoms46311,56418,8150
Unresolved pocket atoms1,1029551,57912,4570
Non-standard pocket residues3072459724,4730
Structure corrections
Protonation
Energy minimization

Comparison of dataset scope, ligand diversity, and structural quality filtering across five protein–ligand interaction databases. CROWN uniquely combines zero structure issues with both protonation and energy minimization. Values for CROWN are reported for the corrected structures.

Access

Freely available under CC BY 4.0

Browse, search, and download individual entries or the complete dataset. The full preprocessing pipeline is open-source.