|
|
GEO help: Mouse over screen elements for information. |
|
Status |
Public on Jun 30, 2023 |
Title |
Twins: A deep learning method for replicate-based conformation contact map analysis. |
Sample organism |
Mus musculus |
Experiment type |
Third-party reanalysis Other
|
Summary |
The organisation of the genome in nuclear space is an important frontier of biology. Chromosome conformation capture methods such as Hi-C and Micro-C produce genome-wide chromatin contact maps that provide rich data containing quantitative and qualitative information about genome architecture. Most conventional approaches to genome-wide chromosome conformation capture data are limited to the analysis of pre-defined features, and may therefore miss important biological information. One constraint is that biologically important features can be masked by high levels of technical noise in the data. Here we introduce Twins, a replicate-based method for deep learning from chromatin conformation contact maps. Using a Siamese network configuration, Twins learns to distinguish technical noise from biological variation and outperforms image similarity metrics across a range of biological systems. Features extracted by Twins from Hi-C maps after perturbation of cohesin and CTCF reflect the distinct biological functions of cohesin and CTCF in the formation of domains and boundaries, respectively. Twins distance metrics are biologically meaningful, as they mirror the density of cohesin and CTCF binding. Taken together, these properties make Twins an powerful tool for the exploration of chromosome conformation capture data, such as Hi-C capture Hi-C, and Micro-C.
|
|
|
Overall design |
This submission contains reprocessing of four chromatin conformation datasets and the resulting fully trained convolutional Siamese networks trained on images extracted from this chromatin conformation data. Each Hi-C replicate file normalised using ICE from hyalocytes processed in GSE93431 and GSE122157 is compared across wildtype (WT), control (TAM), Nipbl-/- (NIPBL) and Ncaph2-/- mice in two Siamese convolutional neural networks (SCNNs) one trained only on the WT, TAM and Nipbl-/- conditions (liver_NIPBLKO) and the other including the Ncaph2-/- (Liver_all_10kb) at 10kb resolution in 2.56Mb windows. Processed Hi-C files are also extracted as individual replicates from GSE94452 for mouse neural progenitor cells depleted of CTCF (aux) and in control (ctl) conditions and a SCNN is trained to compare KR normalised windows along the diagonal at 10kb resolution in 2.56Mb windows (NPC_CTCFdegron_10kb). Mouse CD69-CD4+CD8+ (DP) T-cells extracted from processed Hi-C files at GSE199059 were compared to CD69-CD4+ (CD4 SP) T-cells from GSE222211 at various resolutions (2kb, 5kb, 10kb, 25kb) with various normalisations (VC, ICE, VC_SQRT, KR). Trained SCNNs were produced comparing high (R3 and R4) and low (R1 and R2) resolution DP T-cells and through T-cell differentiation (DP R1 and R2 to CD4 SP R1 and R2) for each normalisation and resolution. Pre-processed Hi-C files from each of the respective studies were transformed into Pytorch datasets compatible with the Twins code base (ea409/twins_hic) and a Siamese network training protocol. Samples were split into validation (chr18), test (chr 2) and train (all other chromosomes) sets. For each Hi-C map, overlapping windows were extracted from along the diagonal, with resolution R and size 256xR from normalised maps. For each window any nan values are set to 0 and it is re-normalised by dividing by the maximum value in the window. Some windows values where the number positions containing with no information (i.e. all 0 values) exceeds 10% of the window are filtered out to avoid training on empty data. Regions are then fed into Siamese convolutional neural networks with LeNet architectures adpated with GeLU activation terms and to fit 256x256 images and trained using contrastive loss learning with an early stop criteria based on a 10% increase in validation loss from its minimal value. .mlhic are processed datasets containing the results of mapped Hi-C data in a format compatable with the Twins software. Each file contains images taken along the diagonal of the Hi-C map and processed into a format suitable for machine learning. .ckpt files are trained models, trained using the Siamese Networks method described in the Twins publication and in the Twins code.
|
|
|
Contributor(s) |
Al-jibury E, King JW, Guo Y, Fisher AG, Merkenschlager M, Rueckert D |
Citation(s) |
37591842 |
|
Submission date |
May 24, 2023 |
Last update date |
Sep 14, 2023 |
Contact name |
Ediem Al-jibury |
E-mail(s) |
ealjibur@ic.ac.uk
|
Organization name |
MRC LMS
|
Street address |
Hammersmith Hospital
|
City |
London |
ZIP/Postal code |
W12 0HS |
Country |
United Kingdom |
|
|
Relations |
Reanalysis of |
GSM4386027 |
Reanalysis of |
GSM4386026 |
Reanalysis of |
GSM4386029 |
Reanalysis of |
GSM4386028 |
Reanalysis of |
GSM2453279 |
Reanalysis of |
GSM2453280 |
Reanalysis of |
GSM2453281 |
Reanalysis of |
GSM2453282 |
Reanalysis of |
GSM2453283 |
Reanalysis of |
GSM2453284 |
Reanalysis of |
GSM3457018 |
Reanalysis of |
GSM3457019 |
Reanalysis of |
GSM5963426 |
Reanalysis of |
GSM5963427 |
Reanalysis of |
GSM5963425 |
Reanalysis of |
GSM5963424 |
Reanalysis of |
GSM6918008 |
Reanalysis of |
GSM6918007 |
Supplementary file |
Size |
Download |
File type/resource |
GSE233377_MicroC.tar.gz |
40.1 Gb |
(ftp)(http) |
TAR |
GSE233377_NPC_data.tar.gz |
9.4 Gb |
(ftp)(http) |
TAR |
GSE233377_liver_data.tar.gz |
6.3 Gb |
(ftp)(http) |
TAR |
GSE233377_sample_file_associations.txt.gz |
825 b |
(ftp)(http) |
TXT |
GSE233377_tcell_data.tar.gz |
44.9 Gb |
(ftp)(http) |
TAR |
Processed data are available on Series record |
|
|
|
|
|