AUTOMATED BIOINFORMATICS PIPELINES ON SUPERCOMPUTERS: CHALLENGES AND EMERGING SOLUTIONS

Published 2026-06-30
PHYSICS-MATHEMATICS Vol. 84 No. 2 (2026)
Том 84 №2 2026
Authors:
  • ASHIMGALIYEV M.
  • MUSSABEK M.
  • MATKARIMOV B.
  • ZHUMADILLAYEVA A.K.
PDF

High-throughput biological data generation has driven the adoption of automated bioinformatics pipelines on high-performance computing (HPC) systems and supercomputers. This systematic review synthesizes 101 studies published between 2018 and 2025, following PRISMA guidelines, to examine workflow management systems (WfMSs) deployed in HPC environments across genomics, transcriptomics, proteomics, and metagenomics domains. We analyzed prominent frameworks including Nextflow, Snakemake, WDL, and CWL, documenting their implementation challenges and emerging solutions. Key challenges identified include scheduler saturation from massive parallelism, I/O bottlenecks on shared file systems, heterogeneous resource allocation, and reproducibility across diverse computing environments. Containerization through Docker and Singularity has emerged as the dominant solution for ensuring portability and reproducibility. Community-driven initiatives like nf-core have accelerated adoption by providing curated, best-practice pipelines. Advanced solutions include HPC-aware scheduling strategies, hybrid cloud-HPC architectures, and GPU integration for machine learning-augmented analyses. While significant progress has been made in automating complex multi-step analyses, continued co-evolution of workflow systems and HPC infrastructure remains essential for handling exascale data volumes and achieving fully reproducible computational biology at scale.

ASHIMGALIYEV M.

PhD, lecturer, department of computer and software engineering, faculty information technologies, L.N. Gumilyov Eurasian national university, Astana, Kazakhstan.

E-mail: ashimgaliyev.medet@gmail.com, https://orcid.org/0009-0003-9829-6187

MUSSABEK M.

Senior lecturer, school of artificial intelligence and data science, Astana IT University, Astana, Kazakhstan

E-mail: miras.k@astanait.edu.kz, https://orcid.org/0009-0009-2353-3524

MATKARIMOV B.

Doctor of technical sciences, professor, lecturer and researcher, department of artificial intelligence technology, faculty information technologies, L.N. Gumilyov Eurasian national university, Astana, Kazakhstan

E-mail: bakhyt.matkarimov@gmail.com, https://orcid.org/0000-0003-0775-7324

ZHUMADILLAYEVA A.K.

Candidate of technical sciences, associate professor, department of computer and software engineering, faculty information technologies, L.N. Gumilyov Eurasian national university, Astana, Kazakhstan.

E-mail: Ainur.Zhumadillayeva@astanait.edu.kz, https://orcid.org/0000-0003-1042-0415

  1. Stephens ZD, Lee SY, Faghri F, Campbell RH, Zhai C, Efron MJ, et al. (2015) Big Data: Astronomical or Genomical? PLoS Biol 13(7): e1002195. https://doi.org/10.1371/journal.pbio.1002195 DOI: https://doi.org/10.1371/journal.pbio.1002195
  2. Zhou, Y., Kathiresan, N., Yu, Z., Rivera, L. F., Thimma, M., Manickam, K., Chebotarov, D., Mauleon, R., Chougule, K., Wei, S., Gao, T., Green, C. D., Zuccolo, A., Ware, D., Zhang, J., … & Wing, R. A. (2024). A high-performance computational workflow to accelerate GATK SNP detection across a 25-genome dataset. BMC Biology, 22, Article 13. https://doi.org/10.1186/s12915-024-01820-5 DOI: https://doi.org/10.1186/s12915-024-01820-5
  3. Djaffardjy, M., Marchment, G., Sebé, C., Blanchet, R., Belhajjame, K., Gaignard, A., Lemoine, F., & Cohen-Boulakia, S. (2023). Developing and reusing bioinformatics data analysis pipelines using scientific workflow systems. Computational and Structural Biotechnology Journal, 21, 2075–2085. https://doi.org/10.1016/j.csbj.2023.03.003 DOI: https://doi.org/10.1016/j.csbj.2023.03.003
  4. Evangelidis, T., & van der Velde, J. (2025). Empowering bioinformatics communities with Nextflow and nf-core. Genome Biology, 26, Article 228. https://doi.org/10.1186/s13059-025-03673-9
  5. Amstutz, P. (Ed.), Crusoe, M. R. (Ed.), Tijanić, N. (Ed.), Chapman, B., Chilton, J., Heuer, M., Kartashov, A., Leehr, D., Ménager, H., Nedeljkovich, M., Scales, M., Soiland-Reyes, S., & Stojanovic, L. (2016). Common Workflow Language, v1.0. figshare. https://doi.org/10.6084/m9.figshare.3115156.v2
  6. Vivian, J., Rao, A. A., Nothaft, F. A., Ketchum, C., Armstrong, J., Novak, A., … & Paten, B. (2017). Toil enables reproducible, open source, big biomedical data analyses. Nature Biotechnology, 35(4), 314–316. https://doi.org/10.1038/nbt.3772 DOI: https://doi.org/10.1038/nbt.3772
  7. Langer, B. E., Amaral, A., Baudement, M. O., et al. (2025). Empowering bioinformatics communities with Nextflow and nf-core. Genome Biology, 26, Article 228. https://doi.org/10.1186/s13059-025-03673-9 DOI: https://doi.org/10.1186/s13059-025-03673-9
  8. Crusoe, M. R., Abeln, S., Iosup, A., Amstutz, P., Chilton, J., Tijanić, N., … & Goble, C. (2021). Methods included: Standardizing computational reuse and portability with the Common Workflow Language. arXiv. https://doi.org/10.48550/arXiv.2105.07028
  9. Page, M. J., McKenzie, J. E., Bossuyt, P. M., Boutron, I., Hoffmann, T. C., Mulrow, C. D., … & Moher, D. (2021). The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ, 372, n71. https://doi.org/10.1136/bmj.n71 DOI: https://doi.org/10.1136/bmj.n71
  10. Ahmed, A. E., Heldenbrand, J., Asmann, Y., Fadlelmola, F. M., Katz, D. S., Kendig, K., … & Zermeno, J. (2019). Genomic workflow management with Swift/T. PLOS ONE, 14(7), e0211608. https://doi.org/10.1371/journal.pone.0211608 DOI: https://doi.org/10.1371/journal.pone.0211608
  11. Ahmed, A. E., Allen, J. M., Bhat, T., Burra, P., Fliege, C. E., Hart, S. N., Heldenbrand, J. R., Hudson, M. E., Istanto, D. D., Kalmbach, M. T., Kapraun, G. D., Kendig, K. I., Kendzior, M. C., Klee, E. W., Mattson, N., Ross, C. A., Sharif, S. M., Venkatakrishnan, R., Fadlelmola, F. M., & Mainzer, L. S. (2021). Design considerations for workflow management systems use in production genomics research and the clinic. Scientific reports, 11(1), 21680. https://doi.org/10.1038/s41598-021-99288-8 DOI: https://doi.org/10.1038/s41598-021-99288-8
  12. Koboldt, D. C., Steinberg, K. M., Larson, D. E., Wilson, R. K., & Mardis, E. R. (2020). Best practices for variant calling in clinical sequencing. Genome Medicine, 12(91). https://doi.org/10.1186/s13073-020-00791-w DOI: https://doi.org/10.1186/s13073-020-00791-w
  13. Larsonneur, E., Mercier, J., Wiart, N., Le Floch, E., Delhomme, O., & Meyer, V. (2018). Evaluating Workflow Management Systems: A Bioinformatics Use Case. DOI: https://doi.org/10.1109/BIBM.2018.8621141
  14. Angelova, N., Danis, T., Lagnel, J., Tsigenopoulos, C. S., & Manousaki, T. (2022). SnakeCube: Containerized and automated pipeline for de novo genome assembly in HPC environments. BMC Research Notes, 15, 98. https://doi.org/10.1186/s13104-022-05978-5 SpringerLink+1 DOI: https://doi.org/10.1186/s13104-022-05978-5
  15. Genome variant calling workflow implementation and deployment in HPC infrastructure. (2021). In 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE. https://doi.org/10.1109/BIBM52615.2021.9669519 ResearchGate+1 DOI: https://doi.org/10.1109/BIBM52615.2021.9669519
  16. Ramos Carneiro, A., Bez, J. L., Osthoff, C., Schnorr, L. M., & Navaux, P. O. A. (2023). Uncovering I/O demands on HPC platforms: Peeking under the hood of Santos Dumont. Journal of Parallel and Distributed Computing, 181, 104768. DOI: https://doi.org/10.1016/j.jpdc.2023.104744
  17. Visconti, A., Martin, T. C., & Falchi, M. (2018). YAMP: a containerised workflow enabling reproducibility in metagenomics research. GigaScience. DOI: https://doi.org/10.1101/223016
  18. Mousavi-Derazmahalleh, M., Stott, A., Lines, R., Peverley, G., Nester, G., Simpson, T., Zawierta, M., De La Pierre, M., Bunce, M., & Christophersen, C. T. (2021). eDNAFlow, an automated, reproducible and scalable workflow for analysis of environmental DNA sequences exploiting Nextflow and Singularity. Molecular ecology resources, 21(5), 1697–1704. https://doi.org/10.1111/1755-0998.13356 DOI: https://doi.org/10.1111/1755-0998.13356
  19. Budiš, J., Krampl, W., Kucharík, M., Hekel, R., Goga, A., Sitarčík, J., ... & Szemes, T. (2024). SnakeLines: integrated set of computational pipelines for sequencing reads. Journal of Integrative Bioinformatics, 20(3), 20220059. DOI: https://doi.org/10.1515/jib-2022-0059
  20. Czech, L., & Exposito-Alonso, M. (2022). grenepipe: a flexible, scalable and reproducible pipeline to automate variant calling from sequence reads. Bioinformatics (Oxford, England), 38(20), 4809–4811. https://doi.org/10.1093/bioinformatics/btac600 DOI: https://doi.org/10.1093/bioinformatics/btac600
  21. Wratten, L., Wilm, A., & Göke, J. (2021). Reproducible, scalable, and shareable analysis pipelines with bioinformatics workflow managers. Nature Methods, 18, 1161–1168. DOI: https://doi.org/10.1038/s41592-021-01254-9
  22. Jalili, V., Afgan, E., Gu, Q., Clements, D., Blankenberg, D., Goecks, J., Taylor, J., & Nekrutenko, A. (2020). The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2020 update. Nucleic acids research, 48(W1), W395–W402. https://doi.org/10.1093/nar/gkaa434 DOI: https://doi.org/10.1093/nar/gkaa434
  23. Zhou, J., Zhang, B., Li, G., Chen, X., Li, H., Xu, X., Chen, S., He, W., Xu, C., Liu, L., & Gao, X. (2024). An AI agent for fully automated multi-omic analyses. Advanced Science, 11(44), e2407094. https://doi.org/10.1002/advs.202407094 DOI: https://doi.org/10.1002/advs.202407094
  24. Lang, O., & colleagues. (2022). ScriptManager: an interactive platform for reducing barriers to genomics analysis for novice bioinformaticians. In Proceedings of the PEARC ’22: Practice and Experience in Advanced Research Computing (Article No. 3535161). ACM. https://doi.org/10.1145/3491418.3535161 DOI: https://doi.org/10.1145/3491418.3535161
  25. Kanitz, A., McLoughlin, M. H., Beckman, L., GA4GH Cloud Workstream, Malladi, V. S., & Ellrott, K. (2024). The GA4GH Task Execution Application Programming Interface: Enabling Easy Multicloud Task Execution. Computing in science & engineering, 26(3), 30–39. https://doi.org/10.1109/mcse.2024.3414994 DOI: https://doi.org/10.1109/MCSE.2024.3414994
  26. Ewels, P.A., Peltzer, A., Fillinger, S. et al. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol 38, 276–278 (2020). https://doi.org/10.1038/s41587-020-0439-x DOI: https://doi.org/10.1038/s41587-020-0439-x
  27. Guo, R., Zhao, Y., Zou, Q., Fang, X., & Peng, S. (2018). Bioinformatics applications on Apache Spark. GigaScience, 7(8), giy098. https://doi.org/10.1093/gigascience/giy098 DOI: https://doi.org/10.1093/gigascience/giy098
  28. Decap, D., de Schaetzen van Brienen, L., Larmuseau, M., Costanza, P., Herzeel, C., Wuyts, R., Marchal, K., & Fostier, J. (2022). Halvade Somatic: Somatic variant calling with Apache Spark. GigaScience, 11, giab094. https://doi.org/10.1093/gigascience/giab094 DOI: https://doi.org/10.1093/gigascience/giab094
  29. Wagner, D. D., Garry, D., Krueger, S., Cole, S., Nadon, C., & Greig, A. (2022). VPipe: An automated bioinformatics platform for assembly and management of viral next-generation sequencing data. Microbiology Spectrum, 10(2), e02564-21. https://doi.org/10.1128/spectrum.02564-21 DOI: https://doi.org/10.1128/spectrum.02564-21
  30. Hitz, B. C., Jin-Wook, L., Jolanki, O., Kagda, M. S., Graham, K., Sud, P., Gabdank, I., Strattan, J. S., Sloan, C. A., Dreszer, T., Rowe, L. D., Podduturi, N. R., Malladi, V. S., Chan, E. T., Davidson, J. M., Ho, M., Miyasato, S., Simison, M., Tanaka, F., Luo, Y., … Cherry, J. M. (2023). The ENCODE Uniform Analysis Pipelines. bioRxiv : the preprint server for biology, 2023.04.04.535623. https://doi.org/10.1101/2023.04.04.535623 DOI: https://doi.org/10.1101/2023.04.04.535623
bioinformatics pipelines, containerization, high-performance computing, reproducibility, scalability, workflow management systems

How to Cite

AUTOMATED BIOINFORMATICS PIPELINES ON SUPERCOMPUTERS: CHALLENGES AND EMERGING SOLUTIONS. (2026). Scientific Journal "Bulletin of the K. Zhubanov Aktobe Regional University", 84(2), 106-123. https://doi.org/10.70239/arsu.2026.t84.n2.12