|
|
GEO help: Mouse over screen elements for information. |
|
Status |
Public on Jul 17, 2024 |
Title |
Interpretably deep learning amyloid nucleation by massive experimental quantification of random sequences |
Organism |
Saccharomyces cerevisiae |
Experiment type |
Other
|
Summary |
More than 50 human diseases are characterized by the deposition of specific protein aggregates in the form of insoluble amyloid fibrils. However, only a very small number of proteins are known to form amyloids with high propensity, limiting our ability to understand, predict and engineer amyloid aggregation from sequence. Here we use a massively parallel assay to quantify the amyloid nucleation propensity of >100,000 random 20 amino acid sequences. Approximately 5% of assayed random sequences nucleate the formation of aggregates, generating a very large and diverse training dataset from which to train models to predict amyloid nucleation. We use this dataset to train CANYA, a convolution-attention hybrid neural network that predicts the propensity of any primary sequence to form amyloids. CANYA outperforms previous predictors of protein aggregation on additional random sequences and out-of-sample datasets including human disease-causing amyloids, with very stable performance across diverse prediction tasks. We adapt and extend recent advances in interpretability of genomic neural networks to elucidate CANYA’s decision-making process and learned grammar and to provide mechanistic insights into amyloid formation. Our results demonstrate the power of massive experimental random sequence-space exploration and provide an interpretable and robust neural network model for understanding, predicting and designing amyloid-forming proteins.
|
|
|
Overall design |
Systematic measurement of the nucleation of random 20mers peptides
|
|
|
Contributor(s) |
Lehner B, Bolognesi B, Thompson M, Martìn M |
Citation(s) |
39071305 |
|
Submission date |
May 23, 2024 |
Last update date |
Aug 16, 2024 |
Contact name |
Mariano Martín |
E-mail(s) |
mmartin@ibecbarcelona.eu
|
Organization name |
IBEC
|
Street address |
c/ Baldiri Reixac 10-12
|
City |
Barcelona |
ZIP/Postal code |
08028 |
Country |
Spain |
|
|
Platforms (1) |
GPL19756 |
Illumina NextSeq 500 (Saccharomyces cerevisiae) |
|
Samples (22)
|
|
Relations |
BioProject |
PRJNA1115911 |
Supplementary file |
Size |
Download |
File type/resource |
GSE268261_MT_MM_TSM_BB_BL_processed_data.xlsx |
6.9 Mb |
(ftp)(http) |
XLSX |
SRA Run Selector |
Raw data are available in SRA |
|
|
|
|
|