Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix glem example #9903

Merged
merged 3 commits into from
Jan 7, 2025
Merged

Fix glem example #9903

merged 3 commits into from
Jan 7, 2025

Conversation

xnuohz
Copy link
Contributor

@xnuohz xnuohz commented Dec 29, 2024

Closes #9899.

@xnuohz
Copy link
Contributor Author

xnuohz commented Dec 29, 2024

Namespace(gpu=0, num_runs=10, num_em_iters=1, dataset='arxiv', pl_ratio=0.5, hf_model='prajjwal1/bert-tiny', gnn_model='SAGE', gnn_hidden_channels=256, gnn_num_layers=3, gat_heads=4, lm_batch_size=256, gnn_batch_size=1024, external_pred_path=None, alpha=0.5, beta=0.5, lm_epochs=10, gnn_epochs=50, gnn_lr=0.002, lm_lr=0.001, patience=3, verbose=False, em_order='lm', lm_use_lora=False, token_on_disk=True, out_dir='output/', train_without_ext_pred=True)
Running on: NVIDIA GeForce RTX 3090
/home/ubuntu/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/ogb/nodeproppred/dataset_pyg.py:69: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  self.data, self.slices = torch.load(self.processed_paths[0])
/home/ubuntu/Projects/pytorch_geometric/torch_geometric/data/in_memory_dataset.py:300: UserWarning: It is not recommended to directly access the internal storage format `data` of an 'InMemoryDataset'. If you are absolutely certain what you are doing, access the internal storage via `InMemoryDataset._data` instead to suppress this warning. Alternatively, you can access stacked individual attributes of every graph via `dataset.{attr_name}`.
  warnings.warn(msg)
Processing...
Done!
Found tokenized file, loading may take several minutes...
40 ['node-feat.csv.gz', 'node-label.csv.gz', 'ogbn-arxiv.csv', 'num-edge-list.csv.gz', 'num-node-list.csv.gz', 'node-gpt-response.csv.gz', 'edge.csv.gz', 'node_year.csv.gz', 'node-text.csv.gz']
train_idx: 136411, gold_idx: 90941, pseudo labels ratio: 0.5, 0.49999450192982264
Building language model dataloader...-->done
GPU memory usage -- data to gpu: 0.10 GB
build GNN dataloader(GraphSAGE NeighborLoader)--># GNN Params: 217640
2024-12-29 20:10:39.539031: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-12-29 20:10:39.556333: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-12-29 20:10:39.556357: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-12-29 20:10:39.556369: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-12-29 20:10:39.559845: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-12-29 20:10:39.939306: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
# LM Params: 4391080
pretraining gnn to generate pseudo labels
Epoch: 01 Loss: 2.1609 Approx. Train: 0.4124
Epoch: 02 Loss: 1.5087 Approx. Train: 0.5615
Epoch: 03 Loss: 1.3920 Approx. Train: 0.5874
Epoch: 04 Loss: 1.3226 Approx. Train: 0.6054
Epoch: 05 Loss: 1.2771 Approx. Train: 0.6165
Train: 0.6071, Val: 0.5839
Epoch: 06 Loss: 1.2425 Approx. Train: 0.6260
Train: 0.6161, Val: 0.5913
Epoch: 07 Loss: 1.2123 Approx. Train: 0.6328
Train: 0.6206, Val: 0.5975
Epoch: 08 Loss: 1.1876 Approx. Train: 0.6383
Train: 0.6222, Val: 0.5898
Epoch: 09 Loss: 1.1627 Approx. Train: 0.6455
Train: 0.6310, Val: 0.6027
Epoch: 10 Loss: 1.1414 Approx. Train: 0.6518
Train: 0.6303, Val: 0.6014
Epoch: 11 Loss: 1.1195 Approx. Train: 0.6568
Train: 0.6427, Val: 0.5998
Epoch: 12 Loss: 1.0970 Approx. Train: 0.6620
Train: 0.6413, Val: 0.6035
Epoch: 13 Loss: 1.0788 Approx. Train: 0.6693
Train: 0.6499, Val: 0.6049
Epoch: 14 Loss: 1.0633 Approx. Train: 0.6713
Train: 0.6530, Val: 0.6076
Epoch: 15 Loss: 1.0447 Approx. Train: 0.6768
Train: 0.6590, Val: 0.6068
Pretrain Early stopped by Epoch: 15
Pretrain gnn time: 12.69s
Saved predictions to output/preds/arxiv/gnn_pretrain.pt
Pretraining acc: 0.6590, Val: 0.6068, Test: 0.5470
EM iteration: 1, EM phase: lm
Move lm model from cpu memory
Epoch 01 Loss: 1.5445 Approx. Train: 0.5880
Epoch 02 Loss: 1.0698 Approx. Train: 0.6829
Epoch 03 Loss: 0.8852 Approx. Train: 0.7028
Epoch 04 Loss: 0.7309 Approx. Train: 0.7183
Epoch 05 Loss: 0.6036 Approx. Train: 0.7326
Train: 0.8627, Val: 0.6461,
Epoch 06 Loss: 0.4984 Approx. Train: 0.7450
Train: 0.8854, Val: 0.6458,
Epoch 07 Loss: 0.4191 Approx. Train: 0.7582
Train: 0.9061, Val: 0.6488,
Epoch 08 Loss: 0.3559 Approx. Train: 0.7688
Train: 0.9240, Val: 0.6469,
Epoch 09 Loss: 0.3080 Approx. Train: 0.7767
Train: 0.9327, Val: 0.6338,
Epoch 10 Loss: 0.2675 Approx. Train: 0.7852
Train: 0.9408, Val: 0.6358,
Early stopped by Epoch: 10,                             Best acc: 0.6488472767542535
EM iteration: 2, EM phase: gnn
Move gnn model from cpu memory
Epoch: 01 Loss: 0.8052 Approx. Train: 0.6347
Epoch: 02 Loss: 0.7703 Approx. Train: 0.6367
Epoch: 03 Loss: 0.7489 Approx. Train: 0.6395
Epoch: 04 Loss: 0.7351 Approx. Train: 0.6409
Epoch: 05 Loss: 0.7227 Approx. Train: 0.6441
Train: 0.6630, Val: 0.6099,
Epoch: 06 Loss: 0.7121 Approx. Train: 0.6454
Train: 0.6626, Val: 0.6030,
Epoch: 07 Loss: 0.7043 Approx. Train: 0.6474
Train: 0.6655, Val: 0.6085,
Epoch: 08 Loss: 0.6893 Approx. Train: 0.6496
Train: 0.6652, Val: 0.6050,
Epoch: 09 Loss: 0.6791 Approx. Train: 0.6517
Train: 0.6744, Val: 0.6033,
Early stopped by Epoch: 9,                             Best acc: 0.6098526796201215
Best GNN validation acc: 0.6098526796201215,LM validation acc: 0.6488472767542535
============================
Best test acc: 0.5541633232516512, model: lm
Total running time: 0.08 hours

@xnuohz
Copy link
Contributor Author

xnuohz commented Dec 29, 2024

cc @puririshi98 @akihironitta

Copy link
Contributor

@puririshi98 puririshi98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks for catching this

@puririshi98 puririshi98 merged commit cb424a6 into pyg-team:master Jan 7, 2025
16 of 17 checks passed
@xnuohz xnuohz deleted the fix/glem-example branch January 8, 2025 01:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Run failed when train_without_ext_pred=True in glem example
3 participants