Home Applications DNA sequence Gene finder

DNA sequence Gene finder

This application is not supported by InterSystems Corporation. Please be notified that you use it at your own risk.
0
0 reviews
0
Awards
18
Views
0
IPM installs
0
0
Details
Releases
Reviews
Issues
Articles  (1)
Find certain genes in DNA sequences

What's new in this version

Initial Release

🧬 DNA SEQUENCE ANALYZER

🧠 Overview

The sequencer_handler.py module continuously monitors for new files generated by the sequencer. When a new file is detected, sequencer_handler.py loads it and calculates the vector representation of each DNA segment.

Simultaneously, solver.py retrieves sequences most similar to the currently analyzed gene. Among these similar sequences, the algorithm applies the Smith–Waterman algorithm to determine if any sequence contains the gene.

The output from solver.py indicates which genes have been identified in the specified bacteria (by barcode).

Control over the entire vector database is managed through the iris_database.py file, which acts as an interface to interact with and manipulate the database. Internally, it utilizes the InterSystems IRIS vector database solution, providing efficient storage, querying, and management of vector data.

DNA_diagram

💻 Installation

Install dependencies:

pip install -r requirements

Download and setup InterSystems IRIS database container: https://github.com/intersystems-community/hackathon-2024/tree/main?tab=readme-ov-file

Run scripts:

python sequencer_handler.py
python solver.py

⏱ Comparison

This approach enables a comparison between using only standard sequence alignment algorithms and our vector-based solution. By leveraging vector representations, we achieve faster, more efficient searches for similar sequences, especially within large datasets. Unlike traditional alignment algorithms, which can be computationally intensive and slower for large-scale comparisons, our vector-based solution allows for rapid identification of potential matches before applying a more precise alignment (like the Smith–Waterman algorithm) for verification.

Times will be added here.

📘 References

Smith TF, Waterman MS. Identification of common molecular subsequences. J Mol Biol. 1981 Mar 25;147(1):195-7. doi: 10.1016/0022-2836(81)90087-5. PMID: 7265238. [https://pubmed.ncbi.nlm.nih.gov/7265238/]

Zhihan Zhou, Yanrong Ji, Weijian Li, Pratik Dutta, Ramana Davuluri, & Han Liu. (2024). DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome. [https://arxiv.org/abs/2306.15006]

Pavan Holur, K. C. Enevoldsen, Shreyas Rajesh, Lajoyce Mboning, Thalia Georgiou, Louis-S. Bouchard, Matteo Pellegrini, and Vwani Roychowdhury. (2024). Embed-Search-Align: DNA Sequence Alignment using Transformer Models. [https://arxiv.org/abs/2309.11087]

Made with
Version
1.0.010 Nov, 2024
Category
Technology Example
Works with
InterSystems IRIS
First published
10 Nov, 2024