gms | German Medical Science

64. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS)

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie

08. - 11.09.2019, Dortmund

Comparison of pathway guided random forests approaches for the integration of biological knowledge

Meeting Abstract

  • Stephan Seifert - Christian-Albrechts-Universität zu Kiel, Kiel, Germany
  • Sven Gundlach - Christian-Albrechts-Universität zu Kiel, Kiel, Germany
  • Olaf Junge - Christian-Albrechts-Universität zu Kiel, Kiel, Germany
  • Silke Szymczak - Christian-Albrechts-Universität zu Kiel, Kiel, Germany

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie. 64. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V. (GMDS). Dortmund, 08.-11.09.2019. Düsseldorf: German Medical Science GMS Publishing House; 2019. DocAbstr. 281

doi: 10.3205/19gmds061, urn:nbn:de:0183-19gmds0615

Published: September 6, 2019

© 2019 Seifert et al.
This is an Open Access article distributed under the terms of the Creative Commons Attribution 4.0 License. See license information at http://creativecommons.org/licenses/by/4.0/.


Outline

Text

High-throughput technologies including microarrays and next generation sequencing allow comprehensive characterization of individuals on many molecular levels. However, training prediction models on omics data is challenging. A promising solution is the integration of external knowledge about structural and functional relationships between molecules into the model building process.

Several random forest (RF) based approaches have been proposed in the literature that enable the selection of important pathways. Two methods called PE (prediction error) [1] and SF (synthetic features) [2] train a separate RF for each pathway and select pathways with low prediction error or importance of the predicted pathway specific outcome. The LeFE algorithm [3] compares the importance of genes within and outside of each pathway and the Hunting approach [4] estimates pathway specific importances.

In our simulation study, PE and SF demonstrated high empirical power across a range of pathway sizes, degrees of association and correlation patterns whereas the two other methods Hunting and LeFE were only able to detect large pathways with strong signal. In the complete null scenario with no differentially expressed genes, Hunting and LeFE falsely detect pathways with strong pairwise correlation, while SF had increased false discovery rates for all pathways.

For the comparison on experimental data sets we downloaded publicly available studies from NCBI GEO database [5] which were used to define the hallmark gene sets from the Molecular Signatures Database (MSigDB) [6]. Again, PE and SF identified many pathways which ususally includes the hallmark gene set, whereas Hunting and LeFE detected a much smaller number of pathways and rarely the hallmark gene set.

In conclusion, PE is a sensitive machine learning approach to detect pathways associated with the outcome even if only a small number of genes is differentially expressed.

The authors declare that they have no competing interests.

The authors declare that an ethics committee vote is not required.


References

1.
Pang H, Lin A, Holford M, Enerson BE, Lu B, Lawton MP, et al. Pathway analysis using random forests classification and regression. Bioinformatics. 2006;22(16):2028-36.
2.
Pan Q, Hu T, Malley JD, Andrew AS, Karagas MR, Moore JH. A System-Level Pathway-Phenotype Association Analysis Using Synthetic Feature Random Forest. Genet Epidemiol. 2014;38(3):209-19.
3.
Eichler GS, Reimers M, Kane D, Weinstein JN. The LeFE algorithm: embracing the complexity of gene expression in the interpretation of microarray data. Genome Biol. 2007;8(9):R187.
4.
Chen X, Ishwaran H. Pathway hunting by random survival forests. Bioinformatics. 2013;29(1):99-105.
5.
Barrett T, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, et al. NCBI GEO: archive for functional genomics data sets-update. Nucleic Acids Res. 2013;41(Database issue):D991-5.
6.
Liberzon A, Birger C, Thorvaldsdóttir H, Ghandi M, Mesirov JP, Tamayo P. The Molecular Signatures Database Hallmark Gene Set Collection. Cell Syst. 2015;1(6):417-25.