Home Applications oncorag

oncorag

InterSystems does not provide technical support for this project. Please contact its developer for the technical assistance.
0
0 reviews
0
Awards
140
Views
0
IPM installs
0
0
Details
Releases (1)
Reviews
Issues
IRIS-integrated RAG pipeline for oncology data curation

What's new in this version

Initial Release

Oncorag2 – Hybrid Clinical Feature Extraction with RAG + Rule-Based Reasoning

Challenge submission for the InterSystems AI Contest:
🧠 Hybrid approach for clinical data curation: combining RAG and rule-based methods

Oncorag2 is a hybrid system designed to extract and curate oncology-related clinical features by combining rule-based regular expressions with retrieval-augmented generation (RAG). It integrates structured and unstructured information, powered by LangChain, LLMs, and the InterSystems IRIS Vector Store.

License: MIT
Python 3.8+


🧠 Overview

Oncorag2 addresses a central challenge in clinical informatics:
How to extract accurate, structured data from complex and variable clinical documentation.

It employs a hybrid methodology that combines:

  • 🧠 Rule-Based Logic: Curated regular expressions provide deterministic and rapid extraction of key attributes (e.g., staging, mutations, treatment) across known sections.
  • 🔍 Retrieval-Augmented Generation (RAG): For ambiguous or unstructured data, relevant document chunks are retrieved using IRIS Vector Store and processed with an LLM for context-aware reasoning.

Benefits:

  • 🧮 Efficient Inference: Targeted retrieval reduces token usage.
  • 🎯 Higher Accuracy: Answers are grounded in patient-specific context.
  • 💬 Natural-Language Interface: Supports conversational queries over both structured and unstructured data.

Key Capabilities:

  • 🧬 Generates LLM-defined clinical feature templates
  • 📑 Applies regex for deterministic extraction from clinical texts
  • 🤖 Combines rule-based and RAG outputs for robust coverage
  • 🧠 Uses IRIS Vector Store for fast, relevant context retrieval
  • 📉 Reduces hallucination risk by grounding responses
  • 🩺 Enables rich, explainable chat-based access to patient data

Use Cases:

  • Oncology cohort curation
  • Digitization of clinical notes
  • NLP research on hospital records
  • LLM-based clinical QA pipelines

⚙️ Installation

Note: The project includes both pip and Git-based dependencies. Use the provided setup scripts for a smooth experience.


🔧 Local Setup

python setup_local.py

This script will:

  • Create a virtual environment
  • Install dependencies
  • Prompt .env creation if missing

Activate the environment:

  • macOS/Linux: source venv/bin/activate
  • Windows: .\venv\Scripts\activate

🐳 Docker Setup

python setup_docker.py

This builds and launches Docker containers for:

  • Clinical feature extraction backend
  • IRIS Vector Store
  • (Optional) Jupyter Notebook Server

Once running:

  1. Access the notebook server at http://localhost:8888.
  2. Run scripts inside the container:
    docker exec -it oncorag2-app bash
    python scripts/run_feature_generation.py
    
  3. Monitor logs:
    docker logs oncorag2-iris-1
    
  4. Shut down services:
    docker compose down
    

🛠 Manual Setup

python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

🔁 DEMO: Script-Based Workflow

1️⃣ run_feature_generation.py

  • Guides definition of clinical features for your use case (e.g., NSCLC, breast cancer)
  • Uses smolagent to generate 5 novel features per round
  • Validates and stores feature schema (JSON in config/)

2️⃣ run_data_extraction.py

  • Uses the saved config to extract features from clinical notes
  • Converts and anonymizes PDFs
  • Applies rule-based extraction with fallbacks
  • Outputs both CSV and vector store for downstream use

3️⃣ run_chatbot.py

  • Enables natural language querying of extracted data
  • Integrates structured CSV + contextual search via IRIS
  • Synthesizes grounded answers using the LLM
  • Supports optional verbose output for retrieved context
python scripts/run_chatbot.py --extracted-data output/extracted_data.csv --verbose

🔐 Environment Variables

Create a .env file in the root directory with the following:

OPENAI_API_KEY=your_openai_api_key
ANTHROPIC_API_KEY=...
GROQ_API_KEY=...
HUGGINGFACE_API_KEY=...
COHERE_API_KEY=...
LOG_LEVEL=INFO
LOG_FILE=oncorag.log

🪪 License

MIT License — see the https://github.com/pgsalome/oncorag/blob/main/LICENSE file.

Made with
Version
1.0.030 Mar, 2025
Ideas portal
Category
Solutions
Works with
InterSystems Vector Search
First published
30 Mar, 2025
Last edited
30 Mar, 2025
Last checked by moderator
14 Nov, 2025Doesn't work