The global microbial smORFs catalogue (GMSC) is an integrated, consistently-processed, smORFs catalogue of the microbial world, combining 63,410 publicly available metagenomes from the SPIRE database and 87,920 high-quality isolated microbial genomes from the ProGenomes2 database.
A total of 4.5 billion smORFs were predicted to build the catalogue. After removing redundancy with 100% amino acid identity, we obtained a 100AA non-redundant catalogue with 964,970,496 sequences. Further, the smORFs were clustered at 90% amino acid identity resulting in 287,926,875 90AA smORFs catalogue.
In GMSC, 100AA and 90AA refer to catalogue identity thresholds rather than peptide length. 100AA is the non-redundant catalogue after collapsing exact amino acid duplicates, while 90AA groups related smORFs into family-level clusters at 90% amino acid identity.
For more details about the GMSC, please see:
Duan, Y., Santos-JĂșnior, C. D., Schmidt, T. S., Fullam, A., de Almeida, B. L. S., Zhu, C., Kuhn, M., Zhao, X.-M., Bork, P. & Coelho, L. P. A catalog of small proteins from the global microbiome. Nature Communications 15, 7563 (2024). DOI:10.1038/s41467-024-51894-6
Additionally, if you use GMSC in your research, please cite the above paper.
Integration:
Main purpose of GMSC:
smORFs in the catalogue are identified with the scheme GMSC10.100AA.XXX_XXX_XXX or GMSC10.90AA.XXX_XXX_XXX. The initial GMSC10 indicates the version of the catalogue (Global Microbial smORFs Catalogue 1.0). The 100AA or 90AA indicates the amino acid identity of the catalogue. The XXX_XXX_XXX is a unique numerical identifier (starting at zero). Numbers were assigned in order of increasing number of copies. So that the greater the number, the greater number of copies of that peptide were present in the raw data.
On the 100AA Sequence page, the following information is displayed for each non-redundant smORF accession.
On the 90AA Cluster page, the following information is displayed for each family-level cluster. The 100AA members of the cluster can be displayed by pressing the show button.
GMSC-mapper is provided as a search tool for querying sequences. Users can provide contigs or protein sequences, and it will return a set of smORFs with complete annotations that match the 90AA smORF families in GMSC.
The search will take ~15 minutes. A search ID will be provided for each query. Search IDs are of the form #-xxxx, where # is an incrementing index and xxxx is a random string.
Users can wait for results on the Mapper page or look them up later from the Home page using the search ID.
GMSC-mapper can also be downloaded and run locally; see details on the GitHub page.

Users can browse by habitats and taxonomy. For example, searching for marine will match entries such as freshwater,marine,human gut. Multiple habitats can be selected.
The results are 90AA smORF families spanning the selected habitats and taxonomy. Each row represents a family-level cluster rather than an individual non-redundant 100AA sequence.
63,410 assembled metagenomes were used from the SPIRE database
87,920 high-quality microbial genomes were downloaded from the ProGenomes v2 database.

All predicted smORFs were removed redundancy with 100% amino acid identity. Then they were clustered with 90% amino acid identity and 90% coverage using Linclust. As a result, the 100AA catalogue stores non-redundant sequences, while the 90AA catalogue stores family-level clusters.
The representative sequences of 90AA smORF families were searched against the NCBI CDD database by RPS-blast. Hits with a maximum e-value of 0.01 and at least 80% coverage of the PSSM length were retained and considered significant.
-org gram+, -org gram-, and -org arch modes.