kmerDB

Jan 2024 · 1 min read

Overview

kmerDB is a comprehensive database that consolidates genomic and proteomic k-mer sequence information across all species in Genbank and UniProt. This resource enables rapid species identification, comparative genomic studies, and evolutionary analysis.

Features

  • Comprehensive Coverage: Encompasses k-mer data from all species in major sequence databases
  • Dual Coverage: Includes both genomic (DNA) and proteomic (amino acid) sequences
  • Fast Queries: Optimized data structures enable rapid k-mer lookups
  • Species Identification: Enables efficient molecular diagnostics and species authentication
  • 100-fold Compression: Novel compression procedures reduce data storage requirements dramatically

Technical Implementation

The database was built using advanced compression algorithms achieving 100-fold data reduction while maintaining query performance. This enables storage and analysis of k-mer information from the entire tree of life.

Applications

  • Species identification and authentication
  • Comparative genomics
  • Evolutionary studies
  • Molecular diagnostics
  • Environmental monitoring
  • Food authentication

Publications

Mouratidis, I., Baltoumas, F. A., Chantzi, N., et al. (2024). kmerDB: A database encompassing the set of genomic and proteomic sequence information for each species. Computational and Structural Biotechnology Journal, 23.

Ioannis Mouratidis
Authors
Research Engineer
Research engineer focused on AI safety, alignment, and AIxBio security. I study how capabilities and values emerge during model training and build scalable interventions to align frontier models, drawing on deep expertise in biosecurity and biological foundation models. 38 publications (12 first or senior author) and 3 patents in large-scale data analysis and ML for biology; co-founded an AI-driven cancer-diagnostics startup and authored grants securing $4M+ in competitive funding.