Home Applications DNA-similarity-and-classify


This application is not supported by InterSystems Corporation. Please be notified that you use it at your own risk.
1 reviews
IPM installs
Pull requests
Classify gene family find similar DNAs with Vector Search and ML

What's new in this version

Initial Release

DNA similarity and classify

The DNA Similarity and Classification is a REST API that utilizes InterSystems Vector Search technology. Its primary aim is to identify genetic similarities by analyzing a vector database derived from interpreting a dataset obtained during the implementation of the project: Human DNA Sequences Dataset.



To download the project, use the following command in the terminal:

git clone https://github.com/Davi-Massaru/DNA-similarity-and-classify.git

To run the project, use the command:

docker-compose up -d --build

ℹ️ Attention: The project execution process can take a long time due to the DNA gene vectorization time, between 30 and 40 minutes.

Video Demo

To understand how the project works, please watch the demonstration at this link.


Vector Search

The system is capable of receiving a DNA sequence as input and, through IRIS Vector Search, identifying DNAs with similar genes. This functionality enables rapid identification of patterns and similarities among the provided genetic sequences.

Initially, a long biological sequence is partitioned into “words” of length k with overlap. For instance, when employing “words” of length 6 (referred to as hexamers), the sequence “ATGCATGCA” is segmented into: ‘ATGCAT’, ‘TGCATG’, ‘GCATGC’, ‘CATGCA’. Consequently, the example sequence is divided into 4 hexamer words. While hexamers are utilized here, the word length is arbitrary and can be adjusted to accommodate specific requirements. Both the word length and the degree of overlap must be determined empirically for each application. In genomics, these procedures are commonly known as “k-mer counting,” where the frequencies of each possible k-mer are tallied.

In this project, k-mer counting serves as a foundational technique for processing genetic sequence data, facilitating the comparative analysis of DNA.

Exemplo de imagem

DNA sequences are vectorized based on k-mers, utilizing IRIS vector search for storage and retrieval, enabling the identification of DNAs similar to the received DNA within the database.

Machine Learning

Furthermore, the system provides a solution for processing and analyzing genetic sequence data using machine learning. Definitions are provided for each of the seven classes, along with their respective quantities in human training data. These classes represent functional groups of genetic sequences.

Exemplo de imagem

  • G protein coupled receptors: G protein-coupled receptors are a class of cell membrane proteins that detect signaling molecules and activate signal transduction pathways within the cell.

  • tyrosine kinase: Tyrosine kinases are enzymes that transfer phosphate groups from ATP to tyrosine residues in target proteins, playing important roles in regulating cellular processes such as growth, differentiation, and cell death.

  • tyrosine phosphatase: Tyrosine phosphatases are enzymes that remove phosphate groups from tyrosine residues in proteins, playing a role in regulating cellular processes opposite to tyrosine kinases.

  • synthetase: Synthetases are enzymes that catalyze the synthesis of complex molecules from simpler units, typically utilizing ATP energy.

  • synthase: Similar to synthetases, synthases also catalyze synthesis reactions but typically do not require ATP energy.

  • ion channel: Ion channels are cell membrane proteins that allow selective transport of ions across the membrane, playing a crucial role in regulating membrane potential and transmitting electrical signals in cells.

  • transcription factor: Transcription factors are proteins that bind to DNA and control gene transcription, thereby regulating gene expression and influencing various cellular and developmental processes.

Machine learning model validation

  Predicted   0    1   2    3    4   5    6
  0          99    0   0    0    1   0    2
  1           0  104   0    0    0   0    2
  2           0    0  78    0    0   0    0
  3           0    0   0  124    0   0    1
  4           1    0   0    0  143   0    5
  5           0    0   0    0    0  51    0
  6           1    0   0    1    0   0  263
  accuracy = 0.984 
  precision = 0.984 
  recall = 0.984 
  f1 = 0.984

Sequence Diagram

    actor user
    user ->> API: Sends DNA sequence
API->dc.data.HumanDNA: Sends DNA sequence
dc.data.HumanDNA->Vector Search: Sends DNA k-mer vectors
Vector Search-->API: Returns similar DNAs

API->dc.data.DNAClassification: Sends DNA sequence
dc.data.DNAClassification->ModelML: Sends DNA sequence

ModelML-->API: Provides DNA class classification and probability

API-->user: Returns similar DNAs to the input and their classification

Data Input

The API can be accessed via the following address:


  • dna: Enter the DNA sequence you want to search



Data Output

When making the request, consider the following response format in JSON:

    "DNA_similarity_list": [
            "dnaClass": "tyrosine phosphatase"
            "dnaClass": "tyrosine phosphatase"
            "dnaClass": "tyrosine phosphatase"
            "dnaClass": "tyrosine phosphatase"
            "dnaClass": "tyrosine phosphatase"
    "DNA_classification": {
        "Classification": "tyrosine phosphatase",
        "Probabilities": {
            "G protein coupled receptors": 6.518793490667266e-33,
            "tyrosine kinase": 1.2683030622550353e-41,
            "tyrosine phosphatase": 1.0,
            "synthetase": 1.4080437446499362e-34,
            "synthase": 1.0239922685101042e-40,
            "lon channel": 1.155073717874559e-46,
            "transcription factor": 9.150946071328282e-44

Technologies Used

  • InterSystems IRIS:
    Used for creating the vectorized database and structuring the REST API.
  • Docker Container: Used to create the environment and IRIS application, so that, through a single command, “docker-compose up,” the entire project is ready for use.
  • Embedded Python: Used to create all application scripts, such as: Assembling the vectorized database by inserting VECTOR type data.
  • Vector search: Used as a mechanism to search for DNA sequences, comparing similar genetic chromosomes in terms of genetic semantics.

Team Members

GitHub : Davi Muta

GitHub : Nicole Alves

Read more
Made with
1.0.012 May, 2024
Works with
InterSystems IRISInterSystems IRIS for Health
First published
12 May, 2024