See also the Zenodo repository for files and details.
We provide the 964,970,496 non-redundant smORFs catalogue at 100% amino acid identity and 287,926,875 smORF families at 90% amino acid identity.
Fasta file of 100AA / 90AA protein sequences.
100AA smORF catalogue: GMSC10.100AA.faa.xz
90AA smORF families: GMSC10.90AA.faa.xz
Fasta file of 100AA / 90AA nucleotide sequences.
100AA smORF catalogue: GMSC10.100AA.fna.xz
90AA smORF families: GMSC10.90AA.fna.xz
TSV table relating 100AA smORF accession and the hierarchically obtained families at 90% amino acid identity (which represent sequences with the same function).
Columns:
100AA smORF accession90AA smORF accessionProtein clustering table: GMSC10.cluster.tsv.xz
TSV table containing the annotations of 100AA smORF catalogue. The position in the file indicates which smORF the annotation refers to (e.g., the line with index 1,621 refers to GMSC10.100AA.000_001_621 — index starting at 0/GMSC10.100AA.000_000_000). Two column are:
Columns:
Habitat annotation: separated by commaTaxonomic annotation: separated by semicolon100AA smORF catalogue: GMSC10.100AA.annotation.tsv.xz
TSV table containing the annotations of 90AA smORF families. The position in the file indicates which smORF the annotation refers to (e.g., the line with index 1,621 refers to GMSC10.90AA.000_001_621 — index starting at 0/GMSC10.90AA.000_000_000). The table contains five columns:
Columns:
Habitat annotation: separated by commaTaxonomic annotation: separated by semicolonConserved domain annotation: identifiers in CDD database, separated by commaTMHMM prediction: the number of predicted transmembrane helices and the topology predicted, separated by semicolonSignalP prediction90AA smORF families: GMSC10.90AA.annotation.tsv.xz
TSV table containing the quality assessment of 100AA smORF catalogue and 90AA smORF families. As above, the position in the file indicates which smORF the annotation refers to (e.g., the line with index 1,621 refers to GMSC10.100AA.000_001_621 / GMSC10.90AA.000_001_621 — index starting at 0). Six quality assessment metrics are provided:
Columns:
AntiFam: 'T' represents that smORF does not belong to the Antifam family. 'F' is the opposite.Terminal checking: 'T' represents that the upstream of smORF contains an in-frame STOP codon to rule out the possibility that the smORF is part of a broken gene due to contig fragmentation. 'F' is the opposite. 'NA' represents the checking was not performed on smORFs from Progenomes2 database.RNAcode: P-value from RNAcode. 'NA' represents the checking was not performed on smORFs families (8 members) or no reports in the results.MetaTranscriptome: The number of samples that smORFs are mapped.Ribo-Seq: The number of samples that smORFs are mapped.MetaProteome: The total k-mer coverage of peptides on smORFs.100AA smORF catalogue: GMSC10.100AA.quality_test.tsv.xz
90AA smORF families: GMSC10.90AA.quality_test.tsv.xz
TSV table relating the metadata of GMSC.
The position in the file indicates which line each row refers to
Columns:
samples: separated by comma100AA smORF catalogue: GMSC10.100AA.sample.tsv.xz
TSV table relating to metadata of samples used in GMSC.
Columns:
sample_accessionena_ers_sample_iddatabaseaccess_statusstudystudy_accessionpublicationscollection_datealiasesmicroontologyenvironment_biomeenvironment_featureenvironment_materialgeographic_locationlatitudelongitudehost_common_namehost_scientific_namehost_tax_idgeneral_envo_namehigher_environmentMetadata: GMSC10.metadata.tsv