CIO Insider

CIOInsider India Magazine

Separator

These Two Deep Learning Models Could Predict the 3D Structure & Function of Your DNA

Separator

The spark of software in biology creates sensational headlines. Technology makes things more understandable and helps to determine the structure of biomolecules. The last decade was revolutionary for the genomic sciences. Researchers believed that full genome sequencing of extinct species such as the woolly Mammoth and Neanderthals was impossible. They used the latest available technology to the best available technology at the time was incredibly demanding in terms of fossil material, experimental workload, and cost. Through the PCR process, each piece of target genomic DNA had to be amplified several times via PCR. Then, ideally, PCR amplicons had to be disseminated using bacterial vectors, and a number of clones had to be sequenced before a consensus sequence devoid of sequencing errors could be generated.

By the time the first draft of the mammoth genome was characterized, new sequencing technologies with higher throughput were available. A new technology platform called Illumina Genome Analyzer II could generate 180 million sequence reads per run. This massive sequencing throughput, combined with the high endogenous DNA content of hair and preservation in a cold environment, made sequencing the first ancient human genome possible.

The biggest mystery is the use of artificial intelligence in detecting the 3D structure and protein folding issue, which has remained unsolved for the last 50 years. The innovative programs called AlphaFold2 and RoseTTAFold help in discovering the structure of the protein as well as the amino acid sequencing. Previously researchers used x-ray crystallography to detect the structure of the protein, and the process took years to find it. Even the cost was dollars together.

AI in Genomic Science
In the past few years, MI has driven the genomics sciences. In the last decades, ML has been widely used in many areas of genomics sciences, especially those characterized by the production of large amounts of data or complex mechanisms governed by the synergic participation of different factors. Important applications include prediction of DNA regulatory regions; discovery of cell morphology and spatial organization; identification of associations between phenotypes and genotypes; classification of DNA methylation and histone modifications; biomarkers discovery; transcriptional enhancers detection; cancer diagnosis and analysis of evolutionary mechanisms.

Since the 1980s, we have witnessed the first attempts to apply supervised training techniques to genomics sciences. In 1982, Stormo et al. used the Perceptron algorithm to distinguish E. coli translational initiation sites from all other sites in a library of over 78.000 nucleotides of mRNA sequence. In 1993, researchers implemented a neural network to predict the protein secondary structure. Deep Learning techniques began to be massively used in functional genomics only in the second decade of the 2000s due to improved PC performance and the collapse of genome sequencing costs.

In 2015, two important deep architectures were implemented and applied to functional genomics, producing results of great scientific impact. DeepBind is a fully automatic stand-alone software for predicting DNA and RNA binding protein sequence specificities. DeepSEA (deep learning-based sequence analyzer) predicts chromatin effects of sequence alterations with the single-nucleotide resolution by learning regulatory sequences from large-scale chromatin-profiling data. Both methods, based on deep architectures, have overcome many challenges such as processing millions of sequences, the generalization between data from different technologies, the tolerance of noise and missing data, and the end-to-end and totally automatic learning without the need for hand-tuning. These approaches outperformed other state-of-the-art methods and encouraged many scientists to follow similar exciting paths.

Researchers say that only one percent of the DNA present in the human body codes the protein, and the remaining holds the regulatory elements such as s promoters, enhancers, silencers, and insulators that control how the coding DNA is expressed

Early this year, researchers from Stanford University, NVIDIA, Oxford Nanopore Technologies, Google, Baylor College of Medicine, and the University of California at Santa Cruz developed a method to do DNA sequencing in five hours and two minutes. They entered the Guinness World Record for the fastest DNA sequencing technique. They used AI to expedite the end-to-end process, from collecting a blood sample to sequencing the whole genome and identifying variants linked to diseases. The researchers made the diagnosis for a three-month-old infant suffering from a rare seizure-causing genetic disorder in a few hours. The traditional gene panel analysis takes as long as two weeks to return results.

Likewise, by using artificial intelligence (AI), researchers are able to detect the three-dimensional structure of the regulatory elements of DNA. This helps in the detection of diseases caused by the mutation. The technology precisely tells us the changes in the DNA sequence and also about the non-coding regions. Researchers say that only one percent of the DNA present in the human body codes the protein, and the remaining holds the regulatory elements such as s promoters, enhancers, silencers, and insulators that control how the coding DNA is expressed.

Sei could Detect 3D Structure
Researchers created a deep learning model called Sei to understand the regulatory elements better. This model sorts out the non-coding part of the DNA into 40 sequence classes. It acts as an enhancer for the brain cell or stem cells. The 40 sequence classes are produced by nearly 22,000 data sequencing from the previous genome regulation. They are used in the detection of the data sets from previous studies studying genome regulation, covering more than 97 percent of the human genome. Moreover, Sei can score any sequence by its predicted activity in each of the 40 sequence classes and predict how mutations impact such activities.

Using Sei, researchers could determine the architecture of 47 traits and diseases caused by the mutation of those traits. Such capabilities would help detect the complete structure of the genome. Previously the researchers examined and developed a model called Orca that determines the structure of the DNA based on sequence segments. That includes the segments that carry the mutants that lead to diseases such as leukemia and limb malformation. The model also helps to understand how DNA controls its local and large-scale 3D structure. Together with the Sei, Orca helps in better understanding the structure of the DNA’s 3D structure.

Current Issue
Datasoft Computer Services: Pioneering The Future Of Document Management & Techno-logical Solutions