An Efficient Automated Machine Learning Framework for Genomics and Proteomics Sequence Analysis

  • Genomics and Proteomics sequence analyses are the scientific studies of understanding the language of Deoxyribonucleic Acid (DNA), Ribonucleic Acid (RNA) and protein biomolecules with an objective of controlling the production of proteins and understanding their core functionalities. It helps to detect chronic diseases in early stages, root causes of clinical changes, key genetic targets for pharmaceutical development and optimization of therapeutics for various age groups. Most Genomics and Proteomics sequence analysis work is performed using typical wet lab experimental approaches that make use of different genetic diagnostic technologies. However, these approaches are costly, time consuming, skill and labor intensive. Hence, these approaches slow down the process of developing an efficient and economical sequence analysis landscape essential to demystify a variety of cellular processes and functioning of biomolecules in living organisms. To empower manual wet lab experiment driven research, many machine learning based approaches have been developed in recent years. However, these approaches cannot be used in practical environment due to their limited performance. Considering the sensitive and inherently demanding nature of Genomics and Proteomics sequence analysis which can have very far-reaching as well as serious repercussions on account of misdiagnosis, the main objective of this research is to develop an efficient automated computational framework for Genomics and Proteomics sequence analysis using the predictive and prescriptive analytical powers of Artificial Intelligence (AI) to significantly improve healthcare operations. The proposed framework is comprised of 3 main components namely sequence encoding, feature engineering and discrete or continuous value predictor. The sequence encoding module is equipped with a variety of existing and newly developed sequence encoding algorithms that are capable of generating a rich statistical representation of DNA, RNA and protein raw sequences. The feature engineering module has diverse types of feature selection and dimensionality reduction approaches which can be used to generate the most effective feature space. Furthermore, the discrete and/or continuous value predictor module of the proposed framework contains a wide range of existing machine learning and newly developed deep learning regressors and classifiers. To evaluate the integrity and generalizability of the proposed framework, we have performed a large-scale experimentation over diverse types of Genomics and Proteomics sequence analysis tasks (i.e., DNA, RNA and proteins). In Genomics analysis, Epigenetic modification detection is one of the key component. It helps clinical researchers and practitioners to distinguish normal cellular activities from malfunctioned ones, which can lead to diverse genetic disorders such as metabolic disorders, cancers, etc. To support this analysis, the proposed framework is used to solve the problem of DNA and Histone modification prediction where it has achieved state-of-the-art performance on 27 publicly available benchmark datasets of 17 different species with best accuracy of 97%. RNA sequence analysis is another vital component of Genomics sequence analysis where the identification of different coding and non-coding RNAs as well as their subcellular localization patterns help to demystify the functions of diverse RNAs, root causes of clinical changes, develop precision medicine and optimize therapeutics. To support this analysis, the proposed framework is utilized for non-coding RNA classification and multi-compartment RNA subcellular localization prediction. Where it achieved state-of-the-art performance on 10 publicly available benchmark datasets of Homo sapiens and Mus Musculus species with best accuracy of 98%. Proteomics sequence analysis is essential to demystify the virus pathogenesis, host immunity responses, the way proteins affect or are affected by the cell processes, their structure and core functionalities. To support this analysis, the proposed framework is used for host protein-protein and virus-host protein-protein interaction prediction. It has achieved state-of-the-art performance on 2 publicly available protein protein interaction datasets of Homo Sapiens and Mus Musculus species with best accuracy of 96% and 7 viral host protein protein interaction datasets of multiple hosts and viruses with best accuracy of 94%. Considering the performance and practical significance of proposed framework, we believe proposed framework will help researchers in developing cutting-edge practical applications for diverse Genomic and Proteomic sequence analyses tasks (i.e., DNA, RNA and proteins).
Metadaten
Author:Muhammad Nabeel AsimORCiD
URN:urn:nbn:de:hbz:386-kluedo-71759
DOI:https://doi.org/10.26204/KLUEDO/7175
Advisor:Andreas DengelORCiD
Document Type:Doctoral Thesis
Language of publication:English
Date of Publication (online):2023/02/21
Date of first Publication:2023/02/21
Publishing Institution:Rheinland-Pfälzische Technische Universität Kaiserslautern-Landau
Granting Institution:Rheinland-Pfälzische Technische Universität Kaiserslautern-Landau
Acceptance Date of the Thesis:2022/12/19
Date of the Publication (Server):2023/02/21
Page Number:XXIII, 199, 40, 1
Faculties / Organisational entities:Kaiserslautern - Fachbereich Informatik
DDC-Cassification:6 Technik, Medizin, angewandte Wissenschaften / 600 Technik
Licence (German):Creative Commons 4.0 - Namensnennung, nicht kommerziell, keine Bearbeitung (CC BY-NC-ND 4.0)