Home Applications HackUPC24_Klìnic

HackUPC24_Klìnic

This application is not supported by InterSystems Corporation. Please be notified that you use it at your own risk.
0
0 reviews
0
Awards
179
Views
0
IPM installs
0
1
Details
Releases
Reviews
Videos  (1)
This app has a demo View
Symptoms Clinical Trial Search Tool using Knowledge Graphs

What's new in this version

idea added

The following is a document explaining the project Klìnic, developed during the Hackathon "HackUPC 2024".

Biomedical research is hard, but we can help. Klìnic is an integrated platform to get insights on clinical trial trends in a scoped domain, helping design new experiments and analyze past failures.

The idea is to help clinicians and researchers easily get an overview of the landscape of clinical research trials in a certain field. They just have to input a general description of a disease, such as "A disease that affects young patients, generally male Caucasians". We get the diseases whose description is more similar to that statement by means of the embeddings of their descriptions. Then, we use a knowledge graph to represent the relationships between diseases to find the most similar diseases to the ones the user is interested in. This way, we can do data augmentation and find more clinical trials that are related to the diseases the user is interested in. We can then use a language model to summarize the clinical trials and extract numerical data from them.

Inspiration

Our team is composed of students from different backgrounds, that include computer science, mathematics, and biomedicine. We wanted to create a tool that could help clinicians and researchers get an overview of the landscape of clinical research trials in a certain field easily. We believe that this tool could help them design new experiments and analyze past failures.

What it does

This tool is an integrated platform to get insights on clinical trial trends in a certain domain (for example, diseases that affect young females). The user just has to input a general description of a disease, such as "A disease that affects young patients, generally females, showing symptoms of fatigue and muscle pain".

We get the diseases whose description is more similar to that statement by means of the embeddings of their descriptions. Then, we use a knowledge graph to represent the relationships between diseases to find the most similar diseases to the ones the user is interested in. This way, we can do data augmentation and find more clinical trials that are related to the diseases the user is interested in. We then use a language model to summarize the clinical trials and extract numerical data from them.

How we built it

We wrote the whole frontend (using Streamlit) and most parts of the backend in Python and some backend-part in Matlab. Our system first has to preprocess data that is fed into IRIS. First, there is our knowledge graph which holds information about the relationship between different diseases. For this, we downloaded the MedGen dataset and trained an embedding model. We took the same approach for clinical trials (source) to represent the relationships.

The heart of our logic comprises the following nine steps:

  1. Embed the textual description that the user entered using the model
  2. Using a similarity threshold, Get top-k diseases with the highest cosine similarity from the DB.
  3. Get the similarities of the embeddings from those diseases (cosine similarity of the embeddings of the nodes of such diseases)
    1. we also represent this using a correlation heatmap on our frontend
  4. Potentially filter out the diseases that are not similar enough (e.g., similarity < 0.8)
  5. Augment the set of diseases: add new diseases that are similar to the ones that are already in the set until we reach a defined threshold
    1. we also show the selected diseases in a graph view, whose UI was built in Matlab.
  6. Query the embeddings of the diseases related to each clinical trial (also in the DB) to get the most similar clinical trials to our set of diseases.
  7. Use an LLM to get a summary of the clinical trials in plain text format
  8. Use an LLM to extract statistical insights from the clinical trials (e.g. average minimum and maximum age of patients, average timeframe of the trials, most common gender in the trials etc.).
  9. Show the results using a web app we built to the user. Salient features of the web interface include a graph of the diseases chosen, a summary of the clinical trials, statistical insights about the clinical trials, and a list of the details of the clinical trials considered.

Our setup relies on the demo provided by InterSystems. As can be seen from above, we used IRIS' vector search frequently when determining the similarities.

Key takeaways

  • LangChain "stuff chain summarizer" was used with GPT 4 Turbo for the best possible results for the text summarization. We could extend it to use a recursive character text splitter and reduce the token size for very large tokens, which is a future scope of the project.
  • LangChain "tagging classes" functionality was used to output specific statistics from our raw JSON data. We used this class to give us specific statistical insights into all the clinical trials similar to the disease description that was entered.
  • InterSystems IRIS vector search internally using cosine similarity was used to find the similarity between different diseases using their embeddings stored in the IRIS vector database.
  • InterSystems IRIS vector database was also used for querying the database using SQL and getting valuable insights from the embeddings.
  • We used openAI embeddings with a batch size of 64 and embedding length 128 to encode the text prompts into vectors due to hardware constraints. Learning and building on a bigger feature space is a future scope of this product.
  • We also explored FAISS, a lightweight vector database that supports semantic search functionality.
  • We offer a MATLAB app designed to visualize the significant connections among diseases. Simply input the disease's code name, and the app will generate a node graph illustrating all the direct relationships associated with the entered disease.

This project was built at HackUPC 2024 hackathon : https://devpost.com/software/x-grmvsx

Repo

https://huggingface.co/spaces/klinic-hackupc/klinic/tree/main

Video Demo

https://www.youtube.com/live/iS6RO9GHTs0?si=SNjv341-GSMpkPd0&t=1293

Setup

  1. Get pre-processed data (large files are stored using Git LFS - you need to have it installed)
    git lfs fetch --all
    git lfs checkout
    
  2. Start the IRIS Docker container:
    docker-compose up -d
    
  3. Open a Jupyter notebook we've created to populate the database and run all the cells: database.ipynb
  4. create a .env file and add the following variable:
    OPENAI_API_KEY=<your-openai-api-key>
    
  5. Run the app: streamlit run app.py
Version
1.0.415 May, 2024
Ideas portal
https://ideas.intersystems.com/ideas/DPI-I-555
Category
Frameworks
Works with
InterSystems Vector Search
First published
10 May, 2024