TY - JOUR
T1 - MISATO
T2 - machine learning dataset of protein–ligand complexes for structure-based drug discovery
AU - Siebenmorgen, Till
AU - Menezes, Filipe
AU - Benassou, Sabrina
AU - Merdivan, Erinc
AU - Didi, Kieran
AU - Mourão, André Santos Dias
AU - Kitel, Radosław
AU - Liò, Pietro
AU - Kesselheim, Stefan
AU - Piraud, Marie
AU - Theis, Fabian J.
AU - Sattler, Michael
AU - Popowicz, Grzegorz M.
N1 - Publisher Copyright:
© The Author(s) 2024.
PY - 2024/5
Y1 - 2024/5
N2 - Large language models have greatly enhanced our ability to understand biology and chemistry, yet robust methods for structure-based drug discovery, quantum chemistry and structural biology are still sparse. Precise biomolecule–ligand interaction datasets are urgently needed for large language models. To address this, we present MISATO, a dataset that combines quantum mechanical properties of small molecules and associated molecular dynamics simulations of ~20,000 experimental protein–ligand complexes with extensive validation of experimental data. Starting from the existing experimental structures, semi-empirical quantum mechanics was used to systematically refine these structures. A large collection of molecular dynamics traces of protein–ligand complexes in explicit water is included, accumulating over 170 μs. We give examples of machine learning (ML) baseline models proving an improvement of accuracy by employing our data. An easy entry point for ML experts is provided to enable the next generation of drug discovery artificial intelligence models.
AB - Large language models have greatly enhanced our ability to understand biology and chemistry, yet robust methods for structure-based drug discovery, quantum chemistry and structural biology are still sparse. Precise biomolecule–ligand interaction datasets are urgently needed for large language models. To address this, we present MISATO, a dataset that combines quantum mechanical properties of small molecules and associated molecular dynamics simulations of ~20,000 experimental protein–ligand complexes with extensive validation of experimental data. Starting from the existing experimental structures, semi-empirical quantum mechanics was used to systematically refine these structures. A large collection of molecular dynamics traces of protein–ligand complexes in explicit water is included, accumulating over 170 μs. We give examples of machine learning (ML) baseline models proving an improvement of accuracy by employing our data. An easy entry point for ML experts is provided to enable the next generation of drug discovery artificial intelligence models.
UR - http://www.scopus.com/inward/record.url?scp=85192858770&partnerID=8YFLogxK
U2 - 10.1038/s43588-024-00627-2
DO - 10.1038/s43588-024-00627-2
M3 - Article
AN - SCOPUS:85192858770
SN - 2662-8457
VL - 4
SP - 367
EP - 378
JO - Nature Computational Science
JF - Nature Computational Science
IS - 5
ER -