Skip to content

Commit

Permalink
Merge branch 'dev' into kevin/arm-builds
Browse files Browse the repository at this point in the history
  • Loading branch information
KevDevSha authored May 20, 2024
2 parents 487b9d2 + 57c7b72 commit 7a9a254
Show file tree
Hide file tree
Showing 8 changed files with 91 additions and 32 deletions.
25 changes: 12 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
<a href="https://www.mosaicml.com">[Website]</a>
- <a href="https://docs.mosaicml.com/projects/composer/en/stable/getting_started/installation.html">[Getting Started]</a>
- <a href="https://docs.mosaicml.com/projects/composer/">[Docs]</a>
- <a href="https://www.mosaicml.com/careers">[We're Hiring!]</a>
- <a href="https://www.databricks.com/company/careers/open-positions?department=Mosaic%20AI&location=all">[We're Hiring!]</a>
</p></h4>

<p align="center">
Expand All @@ -33,7 +33,7 @@
<a href="https://docs.mosaicml.com/projects/composer/en/stable/">
<img alt="Documentation" src="https://readthedocs.org/projects/composer/badge/?version=stable">
</a>
<a href="https://mosaicml.me/slack">
<a href="https://dub.sh/mcomm">
<img alt="Chat @ Slack" src="https://img.shields.io/badge/slack-chat-2eb67d.svg?logo=slack">
</a>
<a href="https://github.com/mosaicml/composer/blob/dev/LICENSE">
Expand Down Expand Up @@ -201,7 +201,7 @@ Next, check out our [Getting Started Colab](https://colab.research.google.com/gi

Once you’ve completed the Quick Start, you can go through the below tutorials or our [documentation](https://docs.mosaicml.com/projects/composer/en/stable/) to further familiarize yourself with Composer.

If you have any questions, please feel free to reach out to us on our [Community Slack](https://mosaicml.me/slack)!
If you have any questions, please feel free to reach out to us on our [Community Slack](https://dub.sh/mcomm)!

Here are some resources actively maintained by the Composer community to help you get started:
<table>
Expand Down Expand Up @@ -236,29 +236,28 @@ Here are some resources actively maintained by the Composer community to help yo
</tbody>
</table>

# 🛠️ For Best Results, Use with the MosaicML Ecosystem
# 🛠️ For Best Results, Use within the Databricks & MosaicML Ecosystem

Composer can be used on its own, but for the smoothest experience we recommend using it in combination with other components of the MosaicML ecosystem:

![We recommend that you train models with Composer, MosaicML StreamingDatasets, and the MosaicML platform.](docs/source/_static/images/ecosystem.png)
![We recommend that you train models with Composer, MosaicML StreamingDatasets, and Mosaic AI training.](docs/source/_static/images/ecosystem.png)

- [**MosaicML platform**](https://www.mosaicml.com/training) (MCLI)- Our proprietary Command Line Interface (CLI) and Python SDK for orchestrating, scaling, and monitoring the GPU nodes and container images executing training and deployment. Used by our customers for training their own Generative AI models.
- **To get started, [sign up here](https://www.mosaicml.com/get-started?utm_source=blog&utm_medium=referral&utm_campaign=llama2) to apply for access and check out our [Training](https://www.mosaicml.com/training) and [Inference](https://www.mosaicml.com/inference) product pages**
- [**Mosaic AI training**](https://www.databricks.com/product/machine-learning/mosaic-ai-training) (MCLI)- Our proprietary Command Line Interface (CLI) and Python SDK for orchestrating, scaling, and monitoring the GPU nodes and container images executing training and deployment. Used by our customers for training their own Generative AI models.
- **To get started, [reach out here](https://www.databricks.com/company/contact) and check out our [Training](https://www.databricks.com/product/machine-learning/mosaic-ai-training) product pages**
- [**MosaicML LLM Foundry**](https://github.com/mosaicml/llm-foundry) - This open source repository contains code for training, finetuning, evaluating, and preparing LLMs for inference with [Composer](https://github.com/mosaicml/composer). Designed to be easy to use, efficient and flexible, this codebase is designed to enable rapid experimentation with the latest techniques.
- [**MosaicML StreamingDataset**](https://github.com/mosaicml/streaming) - Open-source library for fast, accurate streaming from cloud storage.
- [**MosaicML Diffusion**](https://github.com/mosaicml/diffusion) - Open-source code to train your own Stable Diffusion model on your own data. Learn more via our blogs: ([Results](https://www.mosaicml.com/blog/stable-diffusion-2) , [Speedup Details](https://www.mosaicml.com/blog/diffusion))
- [**MosaicML Examples**](https://github.com/mosaicml/examples) - This repo contains reference examples for using the [MosaicML platform](https://www.notion.so/Composer-README-Draft-5d30690d40f04cdf8528f749e98782bf?pvs=21) to train and deploy machine learning models at scale. It's designed to be easily forked/copied and modified.

# **🏆 Project Showcase**

Here are some projects and experiments that used Composer. Got something to add? Share in our [Community Slack](https://mosaicml.me/slack)!
Here are some projects and experiments that used Composer. Got something to add? Share in our [Community Slack](https://dub.sh/mcomm)!

- [**MPT Foundation Series:**](https://www.mosaicml.com/mpt) Commercially usable open source LLMs, optimized for fast training and inference and trained with Composer.
- [MPT-7B Blog](https://www.mosaicml.com/blog/mpt-7b)
- [MPT-7B-8k Blog](https://www.mosaicml.com/blog/long-context-mpt-7b-8k)
- [MPT-30B Blog](https://www.mosaicml.com/blog/mpt-30b)
- [**Mosaic Diffusion Models**](https://www.mosaicml.com/blog/training-stable-diffusion-from-scratch-costs-160k): see how we trained a stable diffusion model from scratch for <$50k
- [**replit-code-v1-3b**](https://huggingface.co/replit/replit-code-v1-3b): A 2.7B Causal Language Model focused on **Code Completion,** trained by Replit on the MosaicML platform in 10 days.
- [**replit-code-v1-3b**](https://huggingface.co/replit/replit-code-v1-3b): A 2.7B Causal Language Model focused on **Code Completion,** trained by Replit on Mosaic AI training in 10 days.
- **BabyLLM:** the first LLM to support both Arabic and English. This 7B model was trained by MetaDialog on the world’s largest Arabic/English dataset to improve customer support workflows ([Blog](https://blogs.nvidia.com/blog/2023/08/31/generative-ai-startups-africa-middle-east/))
- [**BioMedLM**](https://www.mosaicml.com/blog/introducing-pubmed-gpt): a domain-specific LLM for Bio Medicine built by MosaicML and [Stanford CRFM](https://crfm.stanford.edu/)

Expand All @@ -268,17 +267,17 @@ Composer is part of the broader Machine Learning community, and we welcome any c

To start contributing, see our [Contributing](https://github.com/mosaicml/composer/blob/dev/CONTRIBUTING.md) page.

P.S.: [We're hiring](https://www.mosaicml.com/careers)!
P.S.: [We're hiring](https://www.databricks.com/company/careers/open-positions?department=Mosaic%20AI&location=all)!

# ❓FAQ

- **What is the best tech stack you recommend when training large models?**
- We recommend that users combine components of the MosaicML ecosystem for the smoothest experience:
- Composer
- [StreamingDataset](https://github.com/mosaicml/streaming)
- [MCLI](https://www.mosaicml.com/training) (MosaicML platform)
- [MCLI](https://www.databricks.com/product/machine-learning/mosaic-ai-training) (Databricks Mosaic AI Training)
- **How can I get community support for using Composer?**
- You can join our [Community Slack](https://mosaicml.me/slack)!
- You can join our [Community Slack](https://dub.sh/mcomm)!
- **How does Composer compare to other trainers like NeMo Megatron and PyTorch Lightning?**
- We built Composer to be optimized for both simplicity and efficiency. Community users have shared that they enjoy Composer for its capabilities and ease of use compared to alternative libraries.
- **How do I use Composer to train graph neural networks (GNNs), or Generative Adversarial Networks (GANs), or models for reinforcement learning (RL)?**
Expand Down
4 changes: 2 additions & 2 deletions composer/cli/launcher.py
Original file line number Diff line number Diff line change
Expand Up @@ -549,11 +549,11 @@ def main():
if os.environ.get(MOSAICML_PLATFORM_ENV_VAR, 'false').lower() == 'true' and str(
os.environ.get(MOSAICML_LOG_DIR_ENV_VAR, 'false'),
).lower() != 'false' and os.environ.get(MOSAICML_GPU_LOG_FILE_PREFIX_ENV_VAR, 'false').lower() != 'false':
log.info('Logging all GPU ranks to Mosaic Platform.')
log.info('Logging all GPU ranks to Mosaic AI Training.')
log_file_format = f'{os.environ.get(MOSAICML_LOG_DIR_ENV_VAR)}/{os.environ.get(MOSAICML_GPU_LOG_FILE_PREFIX_ENV_VAR)}{{local_rank}}.txt'
if args.stderr is not None or args.stdout is not None:
log.info(
'Logging to Mosaic Platform. Ignoring provided stdout and stderr args. To use provided stdout and stderr, set MOSAICML_LOG_DIR=false.',
'Logging to Mosaic AI Training. Ignoring provided stdout and stderr args. To use provided stdout and stderr, set MOSAICML_LOG_DIR=false.',
)
args.stdout = log_file_format
args.stderr = None
Expand Down
8 changes: 4 additions & 4 deletions composer/loggers/mosaicml_logger.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Copyright 2022 MosaicML Composer authors
# SPDX-License-Identifier: Apache-2.0

"""Log to the MosaicML platform."""
"""Log to Mosaic AI Training."""

from __future__ import annotations

Expand Down Expand Up @@ -42,12 +42,12 @@


class MosaicMLLogger(LoggerDestination):
"""Log to the MosaicML platform.
"""Log to Mosaic AI Training.
Logs metrics to the MosaicML platform. Logging only happens on rank 0 every ``log_interval``
Logs metrics to Mosaic AI Training. Logging only happens on rank 0 every ``log_interval``
seconds to avoid performance issues.
When running on the MosaicML platform, the logger is automatically enabled by Trainer. To disable,
When running on Mosaic AI Training, the logger is automatically enabled by Trainer. To disable,
the environment variable 'MOSAICML_PLATFORM' can be set to False.
Args:
Expand Down
4 changes: 2 additions & 2 deletions composer/loggers/remote_uploader_downloader.py
Original file line number Diff line number Diff line change
Expand Up @@ -673,8 +673,8 @@ def _upload_worker(

# defining as a function-in-function to use decorator notation with num_attempts as an argument
@retry(ObjectStoreTransientError, num_attempts=num_attempts)
def upload_file():
if not overwrite:
def upload_file(retry_index: int = 0):
if retry_index == 0 and not overwrite:
try:
remote_backend.get_object_size(remote_file_name)
except FileNotFoundError:
Expand Down
14 changes: 8 additions & 6 deletions composer/utils/retrying.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@

import collections.abc
import functools
import inspect
import logging
import random
import time
Expand Down Expand Up @@ -46,18 +47,16 @@ def retry( # type: ignore
Attempts are spaced out with ``initial_backoff + 2**num_attempts + random.random() * max_jitter`` seconds.
Optionally, the decorated function can specify `retry_index` as an argument to receive the current attempt number.
Example:
.. testcode::
from composer.utils import retry
num_tries = 0
@retry(RuntimeError, num_attempts=3, initial_backoff=0.1)
def flaky_function():
global num_tries
if num_tries < 2:
num_tries += 1
def flaky_function(retry_index: int):
if retry_index < 2:
raise RuntimeError("Called too soon!")
return "Third time's a charm."
Expand All @@ -84,9 +83,12 @@ def wrapped_func(func: TCallable) -> TCallable:

@functools.wraps(func)
def new_func(*args: Any, **kwargs: Any):
retry_index_param = 'retry_index'
i = 0
while True:
try:
if retry_index_param in inspect.signature(func).parameters:
kwargs[retry_index_param] = i
return func(*args, **kwargs)
except exc_class as e:
log.debug(f'Attempt {i} failed. Exception type: {type(e)}, message: {str(e)}.')
Expand Down
6 changes: 3 additions & 3 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -144,6 +144,6 @@ Composer is part of the broader Machine Learning community, and we welcome any c
api_reference/*


.. _Twitter: https://twitter.com/mosaicml
.. _Email: mailto:community@mosaicml.com
.. _Slack: https://mosaicml.me/slack
.. _Twitter: https://twitter.com/DbrxMosaicAI
.. _Email: mailto:mcomm@databricks.com
.. _Slack: https://dub.sh/mcomm
2 changes: 1 addition & 1 deletion examples/checkpoint_autoresume.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
"\n",
"We've put together this tutorial to demonstrate this feature in action and how you can activate it through the Composer trainer.\n",
"\n",
"**🐕 Autoresume via Watchdog**: Composer autoresumption works best when coupled with automated node failure detection and retries on the MosaicML platform. \n",
"**🐕 Autoresume via Watchdog**: Composer autoresumption works best when coupled with automated node failure detection and retries on Mosaic AI training. \n",
"See our [platform docs page](https://docs.mosaicml.com/projects/mcli/en/latest/training/watchdog.html) on enabling this feature for your runs\n",
"\n",
"### Recommended Background\n",
Expand Down
60 changes: 59 additions & 1 deletion tests/loggers/test_remote_uploader_downloader.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@

from composer.core import Event, State
from composer.loggers import Logger, RemoteUploaderDownloader
from composer.utils.object_store.object_store import ObjectStore
from composer.utils.object_store.object_store import ObjectStore, ObjectStoreTransientError


class DummyObjectStore(ObjectStore):
Expand Down Expand Up @@ -190,6 +190,64 @@ def test_remote_uploader_downloader_no_overwrite(
)


def test_allow_overwrite_on_retry(tmp_path: pathlib.Path, dummy_state: State):
file_path = tmp_path / 'samples' / 'sample'
os.makedirs(tmp_path / 'samples')
with open(file_path, 'w') as f:
f.write('sample')

# Dummy object store that fails the first two uploads
# This tests that the remote uploader downloader allows overwriting a partially uploaded file on a retry.
class RetryDummyObjectStore(DummyObjectStore):

def __init__(
self,
dir: Optional[pathlib.Path] = None,
always_fail: bool = False,
**kwargs: Dict[str, Any],
) -> None:
self._retry = 0
super().__init__(dir, always_fail, **kwargs)

def upload_object(
self,
object_name: str,
filename: Union[str, pathlib.Path],
callback: Optional[Callable[[int, int], None]] = None,
) -> None:
if self._retry < 2:
self._retry += 1 # Takes two retries to upload the file
raise ObjectStoreTransientError('Retry this')
self._retry += 1
return super().upload_object(object_name, filename, callback)

def get_object_size(self, object_name: str) -> int:
if self._retry > 0:
return 1 # The 0th upload resulted in a partial upload
return super().get_object_size(object_name)

fork_context = multiprocessing.get_context('fork')
with patch('composer.loggers.remote_uploader_downloader.S3ObjectStore', RetryDummyObjectStore):
with patch('composer.loggers.remote_uploader_downloader.multiprocessing.get_context', lambda _: fork_context):
remote_uploader_downloader = RemoteUploaderDownloader(
bucket_uri=f"s3://{tmp_path}/'object_store_backend",
backend_kwargs={
'dir': tmp_path / 'object_store_backend',
},
num_concurrent_uploads=4,
upload_staging_folder=str(tmp_path / 'staging_folder'),
use_procs=True,
num_attempts=3,
)
logger = Logger(dummy_state, destinations=[remote_uploader_downloader])

remote_uploader_downloader.run_event(Event.INIT, dummy_state, logger)
remote_file_name = 'remote_file_name'
remote_uploader_downloader.upload_file(dummy_state, remote_file_name, file_path, overwrite=False)
remote_uploader_downloader.close(dummy_state, logger=logger)
remote_uploader_downloader.post_close()


@pytest.mark.parametrize('use_procs', [True, False])
def test_race_with_overwrite(tmp_path: pathlib.Path, use_procs: bool, dummy_state: State):
# Test a race condition with the object store logger where multiple files with the same name are logged in rapid succession
Expand Down

0 comments on commit 7a9a254

Please sign in to comment.