DNA sequence Gene finder

InterSystems does not provide technical support for this project. Please contact its developer for the technical assistance.

0 reviews

Awards

Views

IPM installs

Details

Releases (1)

Reviews

Issues

Find certain genes in DNA sequences

What's new in this version

Initial Release

🧬 DNA SEQUENCE ANALYZER

🧠 Overview

The sequencer_handler.py module continuously monitors for new files generated by the sequencer. When a new file is detected, sequencer_handler.py loads it and calculates the vector representation of each DNA segment.

Simultaneously, solver.py retrieves sequences most similar to the currently analyzed gene. Among these similar sequences, the algorithm applies the Smith–Waterman algorithm to determine if any sequence contains the gene.

The output from solver.py indicates which genes have been identified in the specified bacteria (by barcode).

Control over the entire vector database is managed through the iris_database.py file, which acts as an interface to interact with and manipulate the database. Internally, it utilizes the InterSystems IRIS vector database solution, providing efficient storage, querying, and management of vector data.

💻 Installation

Install dependencies:

pip install -r requirements

Download and setup InterSystems IRIS database container: https://github.com/intersystems-community/hackathon-2024/tree/main?tab=readme-ov-file

Run scripts:

python sequencer_handler.py
python solver.py

⏱ Comparison

This approach enables a comparison between using only standard sequence alignment algorithms and our vector-based solution. By leveraging vector representations, we achieve faster, more efficient searches for similar sequences, especially within large datasets. Unlike traditional alignment algorithms, which can be computationally intensive and slower for large-scale comparisons, our vector-based solution allows for rapid identification of potential matches before applying a more precise alignment (like the Smith–Waterman algorithm) for verification.

Times will be added here.

📘 References

Smith TF, Waterman MS. Identification of common molecular subsequences. J Mol Biol. 1981 Mar 25;147(1):195-7. doi: 10.1016/0022-2836(81)90087-5. PMID: 7265238. [https://pubmed.ncbi.nlm.nih.gov/7265238/]

Zhihan Zhou, Yanrong Ji, Weĳian Li, Pratik Dutta, Ramana Davuluri, & Han Liu. (2024). DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome. [https://arxiv.org/abs/2306.15006]

Pavan Holur, K. C. Enevoldsen, Shreyas Rajesh, Lajoyce Mboning, Thalia Georgiou, Louis-S. Bouchard, Matteo Pellegrini, and Vwani Roychowdhury. (2024). Embed-Search-Align: DNA Sequence Alignment using Transformer Models. [https://arxiv.org/abs/2309.11087]

Made with

Python

Repository Documentation License

Version

1.0.010 Nov, 2024

Ideas portal