Global Microbial smORFs Catalogue v1.0

The global microbial smORF catalogue (GMSC) is an integrated, consistently-processed, smORFs catalogue of the microbial world, combining publicly available metagenomes and high-quality isolated microbial genomes. A total of non-redundant ~965 million 100AA ORFs were predicted from 63,410 metagenomes across global habitats from the SPIRE database and 87,920 high-quality isolated microbial genomes from the ProGenomes2 database. The smORFs were clustered at 90% amino acid identity, resulting in ~288 million 90AA smORF families. Here, 100AA and 90AA refer to catalogue identity thresholds rather than peptide length: 100AA is the non-redundant catalogue after collapsing exact amino acid duplicates, while 90AA groups related smORFs into family-level clusters at 90% amino acid identity.

Reference: Duan, Y., Santos-JĂșnior, C. D., Schmidt, T. S., Fullam, A., de Almeida, B. L. S., Zhu, C., Kuhn, M., Zhao, X.-M., Bork, P. & Coelho, L. P. A catalog of small proteins from the global microbiome. Nature Communications 15, 7563 (2024).

  • The annotation of GMSC contains:
    • taxonomy classification
    • habitat assignment
    • quality assessment
    • conserved domain annotation
    • cellular localization prediction

For more information, see (Duan et al., 2024).

Copyright (c) 2023-2026 GMSC authors. All rights reserved.