Announcing Lightning v1.5

Lightning 1.5 introduces Fault-Tolerant Training, LightningLite, Loops Customization, Lightning Tutorials, RichProgressBar, LightningCLI V2, with many more exciting features to be announced.

PyTorch Lightning team
PyTorch

--

PyTorch Lightning v1.5 marks a significant leap of reliability to support the increasingly complex demands of the leading AI organizations and prestigious research labs that rely on Lightning to develop and deploy AI at scale.

PyTorch Lightning's ambition has never been greater as we aim to become the simplest and most flexible framework for expediting any deep learning research to production.

Following this vision, the Lightning v1.5 release has made progress in several vital directions: Improved Stability, Improved On-boarding, Advanced Flexibility, Quality of Life Improvements, and SOTA Experimental Features.

Find the complete release notes here.

Improved Stability

Batch-Level Fault-Tolerant Training

Traditionally, training frameworks save checkpoints at the end of an epoch or after every N steps to recover in case of an accidental failure.

Lightning 1.5 extends this concept further by introducing a batch-level fault-tolerant training mechanism. When enabled and an unexpected failure occurs, Lightning users can resume a failed training from the same failed batch.

There is no need for the user to do anything beyond rerunning the script 🤯 ! In the future, this will enable Elastic Training with Lightning.

Image by Phoeby Naren

Learn more in the documentation.

BFloat16 Support

PyTorch 1.10 introduces torch.bloat16 support for both CPUs/GPUs enabling more stable training compared to native Automatic Mixed Precision (AMP) with torch.float16.

To enable this in PyTorch Lightning, simply do the following:

Improved On-boarding

Trainer Strategy API

PyTorch Lightning in v1.5 introduces a new strategy flag enabling a cleaner distributed training API that also supports accelerator discovery!

  • accelerator refers to the hardware:cpu, gpu, tpu, ipu and auto
  • strategy refers to how to utilize the hardware: dp, ddp, ddp_spawn, deepspeed, etc.
  • devices refers to how many devices of the type accelerator to use.

Passing training strategies (e.g. "ddp") to the accelerator has been deprecated in v1.5.0 and will be removed in v1.7.0. Please use the strategy argument as explained above. TrainingTypePlugin will be renamed StrategyPlugin for v1.6.

PyTorch Lightning includes a registry that holds information about strategies and allows for the registration of new custom ones.

Additionally, you can pass your custom registered training type plugins to the strategy argument.

Lightning Lite | Stepping stone to Lightning

Do you want to keep complete control over your PyTorch code but face challenges with acceleration on CPU, GPUs, and TPUs, adding multi-node support, or mixed precision? Then, Lite is the right choice for you!

Here’s how Lightning Lite makes adding multi GPU training support easier than ever. See following 30-second animated graphic that shows you how to scale your code while maintaining control of your training loop.

Once you use LightningLite, you can now perform automatic accelerator and devices discovery and use the same code with GPUs or TPUs with any plugins such as DeepSpeed Zero 3.

Below, we have 5 MNIST examples showing how to convert from pure PyTorch to PyTorch Lightning using LightningLite gradually.

  1. This script shows how to train a simple CNN over MNIST using vanilla PyTorch.
  2. This script shows you how to scale the previous script to enable GPU and multi-GPU training using LightningLite.
  3. This script shows you how to prepare your conversion from LightningLite to the LightningModule.
  4. This script shows you the result of the conversion to the LightningModule and finally all the benefits you get from Lightning.
  5. This script shows you how to extract the data related components into a LightningDataModule.

Find the dedicated blog post about LightningLite here and also check out the documentation.

Lightning Tutorials

Image by Phoeby Naren

The Lightning 1.5 docs are new and improved and contain a new course from the University of Amsterdam (UvA) that introduces the core concepts of the Deep Learning state of the art and will familiarize you with Lightning’s core features and ecosystem.

The course introduces the core concepts of the Deep Learning state of the art and will familiarize you with Lightning’s core features and ecosystem.

Find the associated blog post to learn more.

Soon, PyTorch Lightning will be hosting a Lightning Tutorial Community Sprint to partner work with Academics from all over the world to enhance their deep learning curriculum by integrating with Lightning’s new tutorial capabilities. Here is the issue tracking the current sprint and the associated Google Form to apply.

LightningCLI V2, No Boilerplate For Reproducible AI

Image by Phoeby Naren

Running non-trivial experiments often requires configuring many different trainer and model arguments such as learning rates, batch sizes, number of epochs, data paths, data splits, number of GPUs, etc., that need to be exposed in a training script as most experiments are launched from command-line.

Implementing command-line tools using libraries such as Python’s standard library argparse to manage hundreds of possible trainer, data, and model configurations is a huge source of boilerplate as follows:

This often leads to basic configurations being hard-coded and inaccessible for experimentation and reuse. Additionally, most of the configuration is duplicated in the signature and argument defaults, as well as docstrings and argument help messages.

Here is all you need to start using the LightningCLI:

For Lightning v1.5, we have implemented a new notation to easily instantiate objects directly from the command line. This dramatically improves the command line experience as you can customize almost any aspect of your training by referencing only class names.

And this works with optimizers and lr_schedulers

But also with LightnigModule and DataModule,

Finally, you can register your own components with the LightningCLI registries as follows:

For more information see the dedicated blog post here and the associated documentation.

Advanced Flexibility

Loop Customization

PyTorch Lightning was created to do the hard work for you. The Lightning Trainer automates all the mechanics of the training, validation, and test routines. To create your model, all you need to do is define the architecture and the training, validation, and test steps and Lightning will make sure to call the right thing at the right time.

Internally, the Lightning Trainer relies on a series of nested loops to properly conduct the gradient descent optimization that applies to 90%+ of machine learning use cases. Even though Lightning provides hundreds of features, behind the scenes, it looks like this:

However, some new research use cases such as: meta-learning, active learning, cross-validation, recommendation systems, etc., require a different loop structure.

To resolve this, the Lightning Team implemented a general while-loop as a python class, the Lightning Loop. Here is its pseudo-code and its full implementation can be found there.

Using Loops has several advantages:

  • You can replace, subclass, or wrap any loops within Lightning to customize their inner workings to your needs. This makes it possible to express any type of research with Lightning.
  • The Loops are standardized and each loop can be isolated from its parent and children. With a simple loop, you might end up with more code, but when dealing with hundreds of features, this structure is the key to scale while preserving a high level of flexibility.
  • The Loop can track its state and save its state within the model checkpoint. This is used with fault-tolerant training to enable auto restart.

Find the dedicated blog post here and its documentation. In the blog post, you will learn how the community created custom loops for Active Learning, Cross-Validation, Yield the LightningModule training step.

CheckpointIO Plugin

As part of our commitment to extensibility, we have abstracted the checkpointing logic into a CheckpointIO plugin. This enables users to adapt Lightning to their infrastructure. Find the documentation here.

Quality of Life Improvements

Rich Progress Bar

Image by Phoeby Naren

We are excited to announce that Lightning now includes support RichProgressBar and RichModelSummary to make the command line training experience more visually appealing.

Rich is a Python library for rich text and beautiful formatting in the terminal

All you have to do is pass the RichProgressBar callback to the Trainer, and Lightning handles the rest for you!

Both callbacks are easily extendable, allowing users to customize how the progress bar metrics and model summary table are displayed. You can finally customize it to your preferences. Here is our Green Is Good theme.

SOTA Experimental Features

Init Meta Context

Right now, there is a race for creating larger and larger models. However, larger models don't fit on a single device. The current approach to scale to trillion size parameters is to shard the model, e.g., chunk its parameters, activations, and optimizer states as described within the Zero-3 paper. However, large models instantiation is still complicated as it requires all devices to be available and connected to perform the sharding.

To remediate this problem, PyTorch 1.10 introduced meta tensors. Meta tensors are like normal tensors, but they carry no data so there are no risks of OOMError.

Using meta tensor, it is possible to instantiate a meta-model and then materialize the model once all devices are connected.

PyTorch Lightning brings this idea further by introducing ainit_meta_context utility where the model can be instantiated on a meta device with no-code change 🤯. The code can be found here or here.

This enables scaling minGPT to 45 Billion parameters with minimal code changes. Learn more here.

Inter Batch Parallelism

Inter Batch Parallelism enables to hide the latency of host-to-device copy of input batches behind computationally intensive operations as follows:

The associated speed-up can be pretty relevant when training a large recommendation engine with PyTorch Lightning. More information will be shared soon.

Enable this experimental feature as follows:

Next Steps

The Lightning Team is more than ever committed to providing the best experience possible to anyone doing optimization with PyTorch. With the PyTorch Lightning API being already stable, breaking changes will be minimal.

If you're interested in helping out with these efforts, find us on Slack!

Built by the PyTorch Lightning creators, let us introduce you to Grid.ai. Our platform enables you to scale your model training without worrying about infrastructure, similarly as Lightning automates the training.

You can get started with Grid.ai for free with just a GitHub or Google Account.

Grid.AI enables you to scale training from your laptop to the cloud without having to modify a single line of code. While Grid supports all the classic machine learning frameworks such as TensorFlow, Keras, and PyTorch, you can use any libraries you wish. Leveraging Lightning features such as Early Stopping, Integrated Logging, Automatic Checkpointing, and CLI enables you to make the traditional MLOps behind model training seem invisible.

--

--

PyTorch Lightning team
PyTorch

We are the core contributors team developing PyTorch Lightning — the deep learning research framework to run complex models without the boilerplate