Initial Release
The sequencer_handler.py
module continuously monitors for new files generated by the sequencer. When a new file is detected, sequencer_handler.py
loads it and calculates the vector representation of each DNA segment.
Simultaneously, solver.py
retrieves sequences most similar to the currently analyzed gene. Among these similar sequences, the algorithm applies the Smith–Waterman algorithm to determine if any sequence contains the gene.
The output from solver.py
indicates which genes have been identified in the specified bacteria (by barcode).
Control over the entire vector database is managed through the iris_database.py file, which acts as an interface to interact with and manipulate the database. Internally, it utilizes the InterSystems IRIS vector database solution, providing efficient storage, querying, and management of vector data.
Install dependencies:
pip install -r requirements
Download and setup InterSystems IRIS database container: https://github.com/intersystems-community/hackathon-2024/tree/main?tab=readme-ov-file
Run scripts:
python sequencer_handler.py
python solver.py
This approach enables a comparison between using only standard sequence alignment algorithms and our vector-based solution. By leveraging vector representations, we achieve faster, more efficient searches for similar sequences, especially within large datasets. Unlike traditional alignment algorithms, which can be computationally intensive and slower for large-scale comparisons, our vector-based solution allows for rapid identification of potential matches before applying a more precise alignment (like the Smith–Waterman algorithm) for verification.
Times will be added here.
Smith TF, Waterman MS. Identification of common molecular subsequences. J Mol Biol. 1981 Mar 25;147(1):195-7. doi: 10.1016/0022-2836(81)90087-5. PMID: 7265238. [https://pubmed.ncbi.nlm.nih.gov/7265238/]
Zhihan Zhou, Yanrong Ji, Weijian Li, Pratik Dutta, Ramana Davuluri, & Han Liu. (2024). DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome. [https://arxiv.org/abs/2306.15006]
Pavan Holur, K. C. Enevoldsen, Shreyas Rajesh, Lajoyce Mboning, Thalia Georgiou, Louis-S. Bouchard, Matteo Pellegrini, and Vwani Roychowdhury. (2024). Embed-Search-Align: DNA Sequence Alignment using Transformer Models. [https://arxiv.org/abs/2309.11087]