CosmologyAI 🌠

This repository is a collection of various Data Science, Data Analytics, AI, and LLM-based experiments (RAG, Fine-Tuning) in the Cosmology and Extragalactic Astronomy domain.

The image above is a Hubble Space Telescope image of the Star-Forming Region LH 95 in the Large Magellanic Cloud.

So far

Assemble Cosmology-related abstracts from the ArXiv dataset (Kaggle, Cornell)

Notebook and script (get_cosmo_data_from_arxiv.*) uploaded to arxiv_project/code.

Build a basic chatbot (No memory) with LangChain and Ollama embeddings, running it locally on a Mac, using Groq for LPU, Gradio for the interface

Notebook and script (chatbot_cmb_basic.*) uploaded to cmb_rag/code. Relevant CMB review papers are in cmb_rag/cmb_data.

The above is a screenshot of the RAG QA (using Gradio)

Create vectordb and persist it using Chroma and the Cosmology arxiv abstracts (~66k abstracts)

Notebook and script (create_cosmo_vectordb.*) uploaded to arxiv_project/code.

Code to take the assembled dataset and build a RAG chatbot (No memory) utilizing Mixtral-8x7B from NVIDIA (LangChain integration), all-MiniLM-L6-v2, LangChain, and ChromadB

Notebook and script (create_cosmo_vectordb.*) uploaded to arxiv_project/code.

The image above is a screenshot of the Mixtral chatbot (No memory)

Using the same tech stack, build a context-based retrieval search

Notebook and script (create_cosmo_vectordb.*) uploaded to arxiv_project/code.

The above is a screenshot of the Semantic search results.

Using Bonito, an A100 GPU on Google Colab, a Dark Matter Review paper, create an Instruction tuning QA dataset

Notebook and script (Instruction_Dataset_Synth_bonito_Dark_Matter_Review.ipynb) uploaded to miscellaneous/code. The Dataset is available on HuggingFace Hub: delayedkarma/dark_matter_instruction_qa.

The above is a screenshot of the generated dataset

LangChain RAG from Scratch (https://github.com/langchain-ai/rag-from-scratch/blob/main/README.md), using ArXiv Cosmology data

First notebook (Overview) uploaded to langchain_astro_rag
Second notebook (Multi-Query, RAG-Fusion, Decomposition, Step-back Prompting, HyDE) uploaded to langchain_astro_rag
Third notebook (Logical and Semantic Routing, Query Structuring for Metadata filters) uploaded to langchain_astro_rag

Working RAG based conversational chatbots (with memory) using Langchain and Streamlit

Uploaded v1, v2 and v3 of the scripts to chatbots/code

The above is a screenshot of the current working version of the astro_v3.. chatbot (Using Streamlit)

Streamlit Chatbot To-Do:

Optimize load-time
Add background image
Make history display cleaner

Initial RAG Evaluate notebook using RAGAS

Uploaded synthetic dataset for evaluation to rag_evaluate/ragas_evaluate/data
Uploaded v1 notebook to rag_evaluate/ragas_evaluate/code

Download relevant papers from arxiv programmatically

Uploaded initial notebook to miscellaneous/code/notebooks/download_arxiv_papers.ipynb

Next Steps and Ideas

Use Bonito, make instruction-tuned dataset to evaluate RAG application.
Evaluate RAG application using RAGAS.
Explore alternative ways to evaluate RAG application.
Visualize RAG application.
Explore fine-tuning an LLM using instruction-tuned dataset.
Evaluate fine-tuned LLM vs pre-trained.
Explore Advanced RAG (Reranking etc) using both LangChain and LlamaIndex.
Explore context evaluation using TruLens.
Explore different fine-tuning methods, perhaps DPO if we can build a Cosmology preference dataset.
Try DSPy for RAG.
Create a proper chatbot with memory.
Get the paper text and build datasets with that.
Build full applications (RAG, Fine-tuning) based on full paper texts.
Build Knowledge Graph RAGs.
Auto-detect formulae from papers, convert them to LaTeX, and verify the correctness.
Agents.
Use AssemblyAI (Or some other tool) to summarize lectures, specifically Cosmology lectures (Leonard Susskind etc): https://www.youtube.com/watch?v=P-medYaqVak&list=PLvh0vlLitZ7c8Avsn6gUaWX05uD5cedO-&ab_channel=Stanford
Evaluate RAG for several different methods, Query decomposition, Step-back Prompting, RAG-Fusion etcetera

The image above is the Hubble Interacting Galaxy IRAS 18090

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CosmologyAI 🌠

So far

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 93 Commits
arxiv_project		arxiv_project
chatbots/code		chatbots/code
cmb_rag		cmb_rag
langchain_astro_rag		langchain_astro_rag
miscellaneous		miscellaneous
rag_evaluate		rag_evaluate
README.md		README.md
cosmology_ai_env.yml		cosmology_ai_env.yml

panchambanerjee/CosmologyAI

Folders and files

Latest commit

History

Repository files navigation

CosmologyAI 🌠

So far

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages