kmerDB
Jan 2024
·
1 min read
Overview
kmerDB is a comprehensive database that consolidates genomic and proteomic k-mer sequence information across all species in Genbank and UniProt. This resource enables rapid species identification, comparative genomic studies, and evolutionary analysis.
Features
- Comprehensive Coverage: Encompasses k-mer data from all species in major sequence databases
- Dual Coverage: Includes both genomic (DNA) and proteomic (amino acid) sequences
- Fast Queries: Optimized data structures enable rapid k-mer lookups
- Species Identification: Enables efficient molecular diagnostics and species authentication
- 100-fold Compression: Novel compression procedures reduce data storage requirements dramatically
Technical Implementation
The database was built using advanced compression algorithms achieving 100-fold data reduction while maintaining query performance. This enables storage and analysis of k-mer information from the entire tree of life.
Applications
- Species identification and authentication
- Comparative genomics
- Evolutionary studies
- Molecular diagnostics
- Environmental monitoring
- Food authentication
Publications
Mouratidis, I., Baltoumas, F. A., Chantzi, N., et al. (2024). kmerDB: A database encompassing the set of genomic and proteomic sequence information for each species. Computational and Structural Biotechnology Journal, 23.

Authors
Ioannis Mouratidis
(he/him)
Research Engineer
Research engineer focused on AI safety, alignment, and AIxBio security. I study how
capabilities and values emerge during model training and build scalable interventions to align
frontier models, drawing on deep expertise in biosecurity and biological foundation models.
38 publications (12 first or senior author) and 3 patents in large-scale data analysis and ML
for biology; co-founded an AI-driven cancer-diagnostics startup and authored grants securing
$4M+ in competitive funding.