This repository is a collection of various Data Science, Data Analytics, AI, and LLM-based experiments (RAG, Fine-Tuning) in the Cosmology and Extragalactic Astronomy domain.
The image above is a Hubble Space Telescope image of the Star-Forming Region LH 95 in the Large Magellanic Cloud.
Assemble Cosmology-related abstracts from the ArXiv dataset (Kaggle, Cornell)
- Notebook and script (
get_cosmo_data_from_arxiv.*
) uploaded toarxiv_project/code
.
Build a basic chatbot (No memory) with LangChain and Ollama embeddings, running it locally on a Mac, using Groq for LPU, Gradio for the interface
- Notebook and script (
chatbot_cmb_basic.*
) uploaded tocmb_rag/code
. Relevant CMB review papers are incmb_rag/cmb_data
.
The above is a screenshot of the RAG QA (using Gradio)
Create vectordb and persist it using Chroma and the Cosmology arxiv abstracts (~66k abstracts)
- Notebook and script (
create_cosmo_vectordb.*
) uploaded toarxiv_project/code
.
Code to take the assembled dataset and build a RAG chatbot (No memory) utilizing Mixtral-8x7B from NVIDIA (LangChain integration), all-MiniLM-L6-v2, LangChain, and ChromadB
- Notebook and script (
create_cosmo_vectordb.*
) uploaded toarxiv_project/code
.
The image above is a screenshot of the Mixtral chatbot (No memory)
Using the same tech stack, build a context-based retrieval search
- Notebook and script (
create_cosmo_vectordb.*
) uploaded toarxiv_project/code
.
The above is a screenshot of the Semantic search results.
Using Bonito, an A100 GPU on Google Colab, a Dark Matter Review paper, create an Instruction tuning QA dataset
- Notebook and script (
Instruction_Dataset_Synth_bonito_Dark_Matter_Review.ipynb
) uploaded tomiscellaneous/code
. The Dataset is available on HuggingFace Hub: delayedkarma/dark_matter_instruction_qa.
The above is a screenshot of the generated dataset
LangChain RAG from Scratch (https://github.com/langchain-ai/rag-from-scratch/blob/main/README.md), using ArXiv Cosmology data
- First notebook (Overview) uploaded to langchain_astro_rag
- Second notebook (Multi-Query, RAG-Fusion, Decomposition, Step-back Prompting, HyDE) uploaded to langchain_astro_rag
- Third notebook (Logical and Semantic Routing, Query Structuring for Metadata filters) uploaded to langchain_astro_rag
Working RAG based conversational chatbots (with memory) using Langchain and Streamlit
- Uploaded v1, v2 and v3 of the scripts to chatbots/code
The above is a screenshot of the current working version of the astro_v3.. chatbot (Using Streamlit)
Streamlit Chatbot To-Do:
- Optimize load-time
- Add background image
- Make history display cleaner
Initial RAG Evaluate notebook using RAGAS
- Uploaded synthetic dataset for evaluation to rag_evaluate/ragas_evaluate/data
- Uploaded v1 notebook to rag_evaluate/ragas_evaluate/code
Download relevant papers from arxiv programmatically
- Uploaded initial notebook to miscellaneous/code/notebooks/download_arxiv_papers.ipynb
Next Steps and Ideas
- Use Bonito, make instruction-tuned dataset to evaluate RAG application.
- Evaluate RAG application using RAGAS.
- Explore alternative ways to evaluate RAG application.
- Visualize RAG application.
- Explore fine-tuning an LLM using instruction-tuned dataset.
- Evaluate fine-tuned LLM vs pre-trained.
- Explore Advanced RAG (Reranking etc) using both LangChain and LlamaIndex.
- Explore context evaluation using TruLens.
- Explore different fine-tuning methods, perhaps DPO if we can build a Cosmology preference dataset.
- Try DSPy for RAG.
- Create a proper chatbot with memory.
- Get the paper text and build datasets with that.
- Build full applications (RAG, Fine-tuning) based on full paper texts.
- Build Knowledge Graph RAGs.
- Auto-detect formulae from papers, convert them to LaTeX, and verify the correctness.
- Agents.
- Use AssemblyAI (Or some other tool) to summarize lectures, specifically Cosmology lectures (Leonard Susskind etc): https://www.youtube.com/watch?v=P-medYaqVak&list=PLvh0vlLitZ7c8Avsn6gUaWX05uD5cedO-&ab_channel=Stanford
- Evaluate RAG for several different methods, Query decomposition, Step-back Prompting, RAG-Fusion etcetera
The image above is the Hubble Interacting Galaxy IRAS 18090