Skip to content

Commit

Permalink
Launch core integration part1 (#192)
Browse files Browse the repository at this point in the history
* Github alpha release

* Checkpoint update (#1)

* CI Updates and Tools (#2)

- Adding GH workflow for CI
- Adding makefile and tools for CI pipeline

* use semantic versioning (#7)

Co-authored-by: Kaustubh Tangsali <[email protected]>

* [fea] Add GraphCast recipe (#6)

* adding graphcast recipe

* partially addressing comments

* addressed more comments

* update logging, checkpointing

* sync with modulus changes

* formatting

* logger wrapper

* resolve conflict

* reverting checkpoint formatting

* formatting

* add license to validation.py

* [bug] graphcast recipe (#9)

* graphcast recipe bug fixes

* adding icospheres.json

* change constants

* fea-ext-vortex_shedding_ddp (#39)

* add ddp, refactor recipe

* black formatting

* update configs

* update wandb util

* black formatting

* update inference

* fix: Checkpoint update (#51)

- Fixes Checkpoint bugs and updating to new static capture

* Version update 0.2.0 (#72)

* Update Launch version to 0.2.0

* Update Modulus Core version in pyproject.toml

* Update bug_report.yml

* Update CHANGELOG.md

* Version main to pre-release 0.3.0a0 (#81)

* fixed checkpoints (#42)

* fixed checkpoints

* fixed nested darcy as well

* fixed checkpoint

---------

Co-authored-by: oliver <[email protected]>

* Version update 0.3.0 (#105)

* Version update

* 0.4.0a0 update

* Fixes an issue when no scaler is given, but it is present in the checkpoint (#111)

* Fixes an issue when no scaler is given, but it is present in the checkpoint

* Updated the CHANGELOG.md

---------

Co-authored-by: LimitingFactor <[email protected]>

* Adding per rank mlflow tracking location to fix race condition and updating fcn_afno config for easier benchmarking (#121)

Signed-off-by: Akshay Subramaniam <[email protected]>

* combine dockerfile changes from launch

* remove separate init file for launch

* add launch dependency group

* add missing package handling, make ruff (pep8) compliant

* Update CHANGELOG.md

* add self-referential

---------

Signed-off-by: Akshay Subramaniam <[email protected]>
Co-authored-by: Nick Geneva <[email protected]>
Co-authored-by: Nicholas Geneva <[email protected]>
Co-authored-by: Mohammad Amin Nabian <[email protected]>
Co-authored-by: Oliver Hennigh <[email protected]>
Co-authored-by: oliver <[email protected]>
Co-authored-by: cfd1 <[email protected]>
Co-authored-by: LimitingFactor <[email protected]>
Co-authored-by: Akshay Subramaniam <[email protected]>
  • Loading branch information
9 people authored Oct 13, 2023
1 parent 6823436 commit 35cd9b8
Show file tree
Hide file tree
Showing 15 changed files with 1,399 additions and 9 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- An experimental version of SFNO to be used in unified training recipe for weather models
- Added distributed FFT utility.
- Added ruff as a linting tool.
- Ported utilities from Modulus Launch to main package.

### Changed

Expand Down
2 changes: 2 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,8 @@ ENV _CUDA_COMPAT_TIMEOUT=90

# Install other dependencies
RUN pip install "h5py>=3.7.0" "mpi4py>=3.1.4" "netcdf4>=1.6.3" "ruamel.yaml>=0.17.22" "scikit-learn>=1.0.2"
RUN pip install "hydra-core>=1.2.0" "termcolor>=2.1.1" "wandb>=0.13.7" "mlflow>=2.1.1" "pydantic>=1.10.2" "imageio>=2.28.1" "moviepy>=1.0.3" "tqdm>=4.60.0"

# TODO remove benchy dependency
RUN pip install git+https://github.com/romerojosh/benchy.git
# TODO use torch-harmonics pip package after the upgrade
Expand Down
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ pytest-internal:

coverage:
coverage combine && \
coverage report --show-missing --omit=*test* --omit=*internal* --fail-under=80 && \
coverage report --show-missing --omit=*test* --omit=*internal* --fail-under=75 && \
coverage html

# For arch naming conventions, refer
Expand Down
13 changes: 13 additions & 0 deletions modulus/launch/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
13 changes: 13 additions & 0 deletions modulus/launch/config/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
18 changes: 18 additions & 0 deletions modulus/launch/logging/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from .console import PythonLogger, RankZeroLoggingWrapper
from .launch import LaunchLogger
from .mlflow import initialize_mlflow
from .wandb import initialize_wandb
94 changes: 94 additions & 0 deletions modulus/launch/logging/console.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
# Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import logging
import os

from termcolor import colored


class PythonLogger:
"""Simple console logger for DL training
This is a WIP
"""

def __init__(self, name: str = "launch"):
self.logger = logging.getLogger(name)
self.logger.handlers.clear()
formatter = logging.Formatter(
"[%(asctime)s - %(name)s - %(levelname)s] %(message)s", datefmt="%H:%M:%S"
)
streamhandler = logging.StreamHandler()
streamhandler.setFormatter(formatter)
streamhandler.setLevel(logging.INFO)
self.logger.addHandler(streamhandler)

# Not sure if this works
self.logger.setLevel(logging.DEBUG)
self.logger.propagate = False # Prevent parent logging

def file_logging(self, file_name: str = "launch.log"):
"""Log to file"""
if os.path.exists(file_name):
os.remove(file_name)
formatter = logging.Formatter(
"[%(asctime)s - %(name)s - %(levelname)s] %(message)s",
datefmt="%H:%M:%S",
)
filehandler = logging.FileHandler(file_name)
filehandler.setFormatter(formatter)
filehandler.setLevel(logging.DEBUG)
self.logger.addHandler(filehandler)

def log(self, message: str):
"""Log message"""
self.logger.info(message)

def info(self, message: str):
"""Log info"""
self.logger.info(colored(message, "light_blue"))

def success(self, message: str):
"""Log success"""
self.logger.info(colored(message, "light_green"))

def warning(self, message: str):
"""Log warning"""
self.logger.warning(colored(message, "light_yellow"))

def error(self, message: str):
"""Log error"""
self.logger.error(colored(message, "light_red"))


class RankZeroLoggingWrapper:
"""Wrapper class to only log from rank 0 process in distributed training."""

def __init__(self, obj, dist):
self.obj = obj
self.dist = dist

def __getattr__(self, name):
attr = getattr(self.obj, name)
if callable(attr):

def wrapper(*args, **kwargs):
if self.dist.rank == 0:
return attr(*args, **kwargs)
else:
return None

return wrapper
else:
return attr
Loading

0 comments on commit 35cd9b8

Please sign in to comment.