Skip to content

This repository is intended to be a collection of various Data Science, Data Analytics, AI and LLM-based experiments (RAG, Fine-Tuning) in the Cosmology and Extragalactic Astronomy domain

Notifications You must be signed in to change notification settings

panchambanerjee/CosmologyAI

Repository files navigation

CosmologyAI 🌠

This repository is a collection of various Data Science, Data Analytics, AI, and LLM-based experiments (RAG, Fine-Tuning) in the Cosmology and Extragalactic Astronomy domain.

Hubble Space Telescope image of the Star-Forming Region LH 95 in the Large Magellanic Cloud

The image above is a Hubble Space Telescope image of the Star-Forming Region LH 95 in the Large Magellanic Cloud.

So far

Assemble Cosmology-related abstracts from the ArXiv dataset (Kaggle, Cornell)
  • Notebook and script (get_cosmo_data_from_arxiv.*) uploaded to arxiv_project/code.
Build a basic chatbot (No memory) with LangChain and Ollama embeddings, running it locally on a Mac, using Groq for LPU, Gradio for the interface
  • Notebook and script (chatbot_cmb_basic.*) uploaded to cmb_rag/code. Relevant CMB review papers are in cmb_rag/cmb_data.

Screenshot of RAG QA

The above is a screenshot of the RAG QA (using Gradio)

Create vectordb and persist it using Chroma and the Cosmology arxiv abstracts (~66k abstracts)
  • Notebook and script (create_cosmo_vectordb.*) uploaded to arxiv_project/code.
Code to take the assembled dataset and build a RAG chatbot (No memory) utilizing Mixtral-8x7B from NVIDIA (LangChain integration), all-MiniLM-L6-v2, LangChain, and ChromadB
  • Notebook and script (create_cosmo_vectordb.*) uploaded to arxiv_project/code.
Screenshot 2024-04-07 at 12 45 13 PM

The image above is a screenshot of the Mixtral chatbot (No memory)

Using the same tech stack, build a context-based retrieval search
  • Notebook and script (create_cosmo_vectordb.*) uploaded to arxiv_project/code.
Screenshot 2024-04-07 at 12 45 13 PM

The above is a screenshot of the Semantic search results.

Using Bonito, an A100 GPU on Google Colab, a Dark Matter Review paper, create an Instruction tuning QA dataset
  • Notebook and script (Instruction_Dataset_Synth_bonito_Dark_Matter_Review.ipynb) uploaded to miscellaneous/code. The Dataset is available on HuggingFace Hub: delayedkarma/dark_matter_instruction_qa.

Screenshot of Questions and Answers generated

The above is a screenshot of the generated dataset

LangChain RAG from Scratch (https://github.com/langchain-ai/rag-from-scratch/blob/main/README.md), using ArXiv Cosmology data
  • First notebook (Overview) uploaded to langchain_astro_rag
  • Second notebook (Multi-Query, RAG-Fusion, Decomposition, Step-back Prompting, HyDE) uploaded to langchain_astro_rag
  • Third notebook (Logical and Semantic Routing, Query Structuring for Metadata filters) uploaded to langchain_astro_rag
Working RAG based conversational chatbots (with memory) using Langchain and Streamlit
  • Uploaded v1, v2 and v3 of the scripts to chatbots/code

The above is a screenshot of the current working version of the astro_v3.. chatbot (Using Streamlit)

Streamlit Chatbot To-Do:

  • Optimize load-time
  • Add background image
  • Make history display cleaner
Initial RAG Evaluate notebook using RAGAS
  • Uploaded synthetic dataset for evaluation to rag_evaluate/ragas_evaluate/data
  • Uploaded v1 notebook to rag_evaluate/ragas_evaluate/code
Download relevant papers from arxiv programmatically
  • Uploaded initial notebook to miscellaneous/code/notebooks/download_arxiv_papers.ipynb
Next Steps and Ideas
  • Use Bonito, make instruction-tuned dataset to evaluate RAG application.
  • Evaluate RAG application using RAGAS.
  • Explore alternative ways to evaluate RAG application.
  • Visualize RAG application.
  • Explore fine-tuning an LLM using instruction-tuned dataset.
  • Evaluate fine-tuned LLM vs pre-trained.
  • Explore Advanced RAG (Reranking etc) using both LangChain and LlamaIndex.
  • Explore context evaluation using TruLens.
  • Explore different fine-tuning methods, perhaps DPO if we can build a Cosmology preference dataset.
  • Try DSPy for RAG.
  • Create a proper chatbot with memory.
  • Get the paper text and build datasets with that.
  • Build full applications (RAG, Fine-tuning) based on full paper texts.
  • Build Knowledge Graph RAGs.
  • Auto-detect formulae from papers, convert them to LaTeX, and verify the correctness.
  • Agents.
  • Use AssemblyAI (Or some other tool) to summarize lectures, specifically Cosmology lectures (Leonard Susskind etc): https://www.youtube.com/watch?v=P-medYaqVak&list=PLvh0vlLitZ7c8Avsn6gUaWX05uD5cedO-&ab_channel=Stanford
  • Evaluate RAG for several different methods, Query decomposition, Step-back Prompting, RAG-Fusion etcetera

Hubble Interacting Galaxy IRAS 18090

The image above is the Hubble Interacting Galaxy IRAS 18090

About

This repository is intended to be a collection of various Data Science, Data Analytics, AI and LLM-based experiments (RAG, Fine-Tuning) in the Cosmology and Extragalactic Astronomy domain

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published