kmerDB

Jan 2024 · 1 min read

Overview

kmerDB is a comprehensive database that consolidates genomic and proteomic k-mer sequence information across all species in Genbank and UniProt. This resource enables rapid species identification, comparative genomic studies, and evolutionary analysis.

Features

  • Comprehensive Coverage: Encompasses k-mer data from all species in major sequence databases
  • Dual Coverage: Includes both genomic (DNA) and proteomic (amino acid) sequences
  • Fast Queries: Optimized data structures enable rapid k-mer lookups
  • Species Identification: Enables efficient molecular diagnostics and species authentication
  • 100-fold Compression: Novel compression procedures reduce data storage requirements dramatically

Technical Implementation

The database was built using advanced compression algorithms achieving 100-fold data reduction while maintaining query performance. This enables storage and analysis of k-mer information from the entire tree of life.

Applications

  • Species identification and authentication
  • Comparative genomics
  • Evolutionary studies
  • Molecular diagnostics
  • Environmental monitoring
  • Food authentication

Publications

Mouratidis, I., Baltoumas, F. A., Chantzi, N., et al. (2024). kmerDB: A database encompassing the set of genomic and proteomic sequence information for each species. Computational and Structural Biotechnology Journal, 23.

Ioannis Mouratidis
Authors
Senior Research Engineer/Scientist Associate
Machine learning and genomics researcher with 35 publications (10 first or senior author). Co-founded AI-driven cancer biomarker startup, authored grants securing $4M+ in competitive funding and currently lead a 5-member team with a focus in developing novel computational methods and testing the capabilities and safety profiles of biological foundation models.