Today, almost everyone is aware of the power of data and how useful it can be to use data to solve various problems. However, not everyone can use data efficiently to drive beneficial insights due to several reasons:
- Lack of awareness: Not being aware that data on hand can be analyzed more efficiently
- Lack of computing resources
- Lack of know-how: Not having enough people on the team that can process and analyze data
Emerging data science technologies such as PyTorch’s Lightning can make a data scientist’s life easier, helping them focus on research instead of struggling with computational problems.
What is PyTorch?
Pytorch is an open-source machine learning library that is based on the Torch library. It is mostly used for machine learning tasks such as computer vision and natural language processing. It was initially developed by Facebook’s AI Research (FAIR) team. The most common interface to use the library is Python, but it is also available in C++.
Plenty of prominent deep learning software were built on top of PyTorch including Uber’s Pyro, Tesla’s Autopilot, HuggingFace’s Transformers and PyTorch’s Lightning. High-level features that PyTorch provides can be listed as:
- Strong acceleration via GPUs which allows tensor computing (like NumPy)
- Deep neural networks built on an automatic differentiation system
What is PyTorch Lightning?
PyTorch Lighting is a more recent version of PyTorch. It is an open-source machine learning library with additional features that allow users to deploy complex models.
As the complexity and scale of deep learning evolved, some software and hardware have started to become inadequate. PyTorch Lightning was developed by the developers of PyTorch to catch up with the emerging technologies and enable users to have a better experience while building deep learning models. PyTorch was built in an era where AI research was mostly about network architectures and plenty of complex models for research or production were built with PyTorch. However, as models started to interact with each other, like Generative Adversarial Networks (GAN) or Bidirectional Encoder Representations from Transformers (BERT), adoption of new technologies became inevitable.
What are the benefits of using PyTorch Lightning?
Deep learning systems, a collection of models interacting with each other, are encapsulated by PyTorch Lightning. This means that Lightning is built for more complicated research and production cases of today’s world, where many models interact with each other using complex rules. For example, GAN models may interact with each other to yield more accurate results and PyTorch Lightning enables this interaction to be simpler than it used to be. You can check this website for a real-life application of GAN models, which creates a new artificial human face every time you refresh the page.
Overcome hardware limitations
PyTorch Lightning aims for users to focus more on science and research instead of worrying about how they will deploy the complex models they are building. Sometimes some simplifications are made to models so that the model can run on the computers available in the company. However, by using cloud technologies, PyTorch Lightning allows users to debug their model which normally requires 512 GPUs on their laptop using CPUs without needing to change any part of the code. You can check more about this cloud-based service that grid.ai provides and join to the waitlist.
PyTorch claims that Lightning has a growing contributor community of 300+ talented deep learning people around the world. This community involves researchers, academic staff and others that are aware of the needs that emerging technologies bring. Therefore, pertinent solutions are provided by this community to develop PyTorch into a more convenient library for certain machine learning tasks.
What are the key features of PyTorch Lightning?
- Scaling ML/DL models to run on any hardware (CPU, GPUs, TPUs) without changing the model
- Making code more readable by decoupling the research code from the engineering
- Reproducing models easier
- Automating most of the training loop
- Removing boilerplates (sections of code that have to be included in many places with little or no alteration)
- Out-of-the-box integration with popular logging/visualizing frameworks such as Tensorboard, MLFlow, Neptune.ai, Comet.ml and Wandb
- Tested with every combination of PyTorch and Python supported versions, operating systems, multi GPUs and TPUs
- PyTorch Lightning has minimal running speed overhead (about 300 ms per epoch compared with PyTorch)
- Computing metrics such as accuracy, precision, recall etc. across multiple GPUs
- Automating optimization process of training models.
What’s new in PyTorch Lightning?
Here, we deep dive into some of the new features.
Research & Production
Lightning’s main goal is to allow professional researchers to try the hardest ideas on the largest compute resources without losing any flexibility. With the launch of PyTorch Lightning, data scientists or researchers can now be the people who also put models into production, as there will not be a need for large teams of machine learning engineers. This helps businesses to cut production times without losing any flexibility needed for research.
A metrics API was also created for easy metric development and usage in PyTorch Lightning. The updated API provides an in-built method to compute metrics across multiple GPUs, while at the same time storing statistics that allows users to compute the metric at the end of an epoch, without having to worry about any of the complexities associated with the distributed backend. Common metrics and their documentation are listed as:
Manual vs automatic optimization
Users no longer need to worry about enabling/disabling grads, doing backward passes, or updating optimizers as long as they return a loss with an attached graph from the training_step like:
def training_step(self, batch, batch_idx):
loss = self.encoder(batch)
The optimization is automated by Lightning. However, some researches like GANs or reinforcement learning where multiple optimizers or an inner loop is present may require turning off automatic optimization. In that case, users can turn off automatic optimization and fully control the training loop themselves by simply passing automatic_optimization=False as a parameter while defining the Trainer:
trainer = Trainer(automatic_optimization=False)
By calling the log() method anywhere on Lightning Module, users will be able to send the logged quantity to the logger of choice. Depending on where log() function is called from, Lightning auto-determines when the logging should take place (on every step or every epoch), but users can override the default behavior manually by using on_step and on_epoch parameters:
def training_step(self, batch, batch_idx):
self.log(‘my_loss’, loss, on_step=True, on_epoch=True, prog_bar=True, logger=True)
Setting on_epoch=True accumulates logged values over the full training epoch.
PyTorch Lightning automatically saves a checkpoint for the user in the current working directory, with the state of the last training epoch. This ensures that the user can resume training in case it is interrupted.
Users can customize the checkpointing behavior to monitor any quantity of the training or validation steps. For example, to update checkpoints based on validation loss, the user can follow the following steps:
- Calculate the desired metric or other quantity to be monitored (e.g. validation loss)
- Log the quantity using the log() method, with a key such as val_loss.
- Initialize the ModelCheckpoint callback, and set monitor to be the key of the quantity.
- Pass the callback to checkpoint_callback Trainer flag.
How to turn your PyTorch code into PyTorch Lightning?
As the library has new features, some modifications to the existing code will be necessary if you want to implement a project built with PyTorch in PyTorch Lightning. You can have an idea about how to turn your code into PyTorch Lightning by watching the following video:
If you want to learn more about how to turn your PyTorch code into PyTorch Lightning, feel free to watch the following in-depth tutorial:
Which parts of ML/DL research can be automated with PyTorch Lightning?
- GPU training (multi/single)
- Distributed GPU (cluster) training
- TPU training
- Experiment management
If you have further questions please do not hesitate to contact us:
How can we do better?
Your feedback is valuable. We will do our best to improve our work based on it.