Add `InstructMol` dataset #9975

xnuohz · 2025-01-23T13:31:15Z

Issue

Detail

compare between InstructMol and MoleculeGPT

data: the same data structure but different data sources, molecular graph + smiles sequence + question + answer
model: almost the same model paradigm, multimodal + QA
so in this PR I only implemented the InstructMol dataset and added it to the MoleculeGPT model example.

xnuohz · 2025-01-23T13:31:48Z

python examples/llm/molecule_gpt.py --epochs 2 --batch_size 64
Setting up 'TinyLlama/TinyLlama-1.1B-Chat-v0.1' with configuration: {'revision': 'main', 'max_memory': {0: '23GiB'}, 'low_cpu_mem_usage': True, 'device_map': 'auto', 'torch_dtype': torch.bfloat16}
2025-01-23 19:51:11.429499: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-01-23 19:51:11.448023: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-01-23 19:51:11.448047: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-01-23 19:51:11.448085: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-01-23 19:51:11.451998: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-01-23 19:51:11.840210: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
/home/ubuntu/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/accelerate/utils/imports.py:313: UserWarning: Intel Extension for PyTorch 2.1 needs to work with PyTorch 2.1.*, but PyTorch 2.5.1 is found. Please switch to the matching version and run again.
  warnings.warn(
Some weights of RobertaModel were not initialized from the model checkpoint at DeepChem/ChemBERTa-77M-MTR and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Total Preparation Time: 6.509785s
Training beginning...
Epoch: 1|2: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 4083/4083 [42:57<00:00,  1.58it/s]
Epoch: 1|2, Train loss: 1.378120, Val loss: 1.275135
Epoch: 2|2: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 4083/4083 [42:50<00:00,  1.59it/s]
Epoch: 2|2, Train loss: 1.248772, Val loss: 1.246007
Total Training Time: 5346.377838s
Test loss: 1.250804
Total Time: 5451.700063s

puririshi98 · 2025-01-23T16:08:27Z

this LGTM and im find to merge it asap, just curious what the reasoning for making instructmol the default argparser is.
i just think its a bit confusing since the example says moleculeGPT. but if there is a strong reason i havent thought of let me know. Ill merge after i understand

xnuohz · 2025-01-24T14:07:31Z

ops, changed it back, the default setting was for testing purposes.

puririshi98

now LGTM

update

d7ab93a

xnuohz requested a review from wsad1 as a code owner January 23, 2025 13:31

xnuohz added 2 commits January 23, 2025 21:33

add changelog

e49861d

fix lint

8ab6ca1

puririshi98 and others added 2 commits January 23, 2025 08:11

Update README.md

0aa2ced

update

d879934

puririshi98 approved these changes Jan 24, 2025

View reviewed changes

puririshi98 merged commit ed89c94 into pyg-team:master Jan 24, 2025
15 of 16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `InstructMol` dataset #9975

Add `InstructMol` dataset #9975

xnuohz commented Jan 23, 2025

xnuohz commented Jan 23, 2025

puririshi98 commented Jan 23, 2025 •

edited

Loading

xnuohz commented Jan 24, 2025

puririshi98 left a comment

Add InstructMol dataset #9975

Add InstructMol dataset #9975

Conversation

xnuohz commented Jan 23, 2025

Issue

Detail

xnuohz commented Jan 23, 2025

puririshi98 commented Jan 23, 2025 • edited Loading

xnuohz commented Jan 24, 2025

puririshi98 left a comment

Choose a reason for hiding this comment

Add `InstructMol` dataset #9975

Add `InstructMol` dataset #9975

puririshi98 commented Jan 23, 2025 •

edited

Loading