RTX 2080Ti Vs GTX 1080Ti: FastAI Mixed Precision training & comparisons on CIFAR-100

A quick peek at neutron:

What is FP16 and why should you care?

Deep Learning is a bunch of matrix op(s) being handled by your GPU. These generally happen using something called FP32, or 32-bit Floating point matrices.

With the recent architectures and CUDA releases, FP16 or 16-bit Floating point computation has become easy. What this allows you to virtually do is, since you’re using tensors of half the size, you can crunch through more examples by increasing your batch_size: or it allows you to use lesser GPU RAM compared to using FP32 training (Also known as Full Precision Training).

In plain English, you can replace (batch_size) with (batch_size)*2 in your code.

The tensor cores are much faster in FP16 computing, which means that you get a speed/performance boost and use lesser GPU RAM as well!

Wait, it isn’t that easy though

Issues involved with Half Precision (The name is derived as 16-bit floating point variables have half the precision of the 32-bit floating point variables):

  • Weight update is imprecise.
  • Gradients can underflow.
  • Activations or loss can overflow.

Due to an obvious loss of precision.

Enter, Mixed Precision

Mixed Precision

To avoid the above-mentioned issues, we do operations in FP16 and switch to FP32 wherever we suspect a loss in precision. Hence, Mixed Precision.

Step 1: Use FP16 wherever possible-for faster compute:

The input tensors are converted to fp16 tensors to allow for faster processing

Step 2: Use FP32 to compute loss (To avoid under/overflow):

The tensors are converted back to FP32 to compute loss values in order to avoid under/overflow.

Step 3:

The FP32 tensors are used to update the weights and then converted back to FP16 to allow forward and backward passes.

Step 4: Loss scaling is done by multiplying or dividing by a scaling factor:

The loss is scaled by multiplying or dividing by a loss scaling factor.

To Summarize:

Mixed Precision in fast.ai

As one may expect from the library, doing mixed precision training in the library is as easy as changing:

learn = Learner(data, model, metrics=[accuracy])

to

learn = Learner(data, model, metrics=[accuracy]).to_fp16()

You can read the exact details of what happens when you do that here.

The module allows to change the forward and backward passes of training using fp16 and allowing a speedup.

Internally, the callback ensures that all model parameters (except batchnorm layers, which require fp32) are converted to fp16, and an fp32 copy is also saved. The fp32 copy (the master parameters) is what is used for actually updating with the optimizer; the fp16 parameters are used for calculating gradients. This helps avoid underflow with small learning rates.

RTX 2080Ti Vs GTX 1080Ti Mixed Precision Training

Setup

The Benchmark Notebooks can be found here

Software Setup:

  • Cuda 10 + corresponding latest Cudnn
  • PyTorch + fastai Library (Compiled from source)
  • Latest Nvidia drivers (at time of writing)

Hardware config:

Our hardware configurations slightly vary so do take the values with a grain of salt.

Tuatini’s setup

  • i7–7700K
  • 32GB RAM
  • GTX 1080Ti (EVGA)

My Setup:

  • i7–8700K
  • 64GB RAM
  • RTX 2080Ti (MSI Gaming Trio X)

Since the process isn’t very RAM intensive nor CPU intensive we chose to share our results here.

Quick Walkthrough:

  • Feed in CIFAR-100 data
  • Resize the images, enable data augmentation
  • Run on all Resnets supported by fastai

Expected output:

  • Better performance across all tests for Mixed Precision training.

Individual Graphs

Below are graphs of training times for the respective ResNets.

Note: Less is better (X-axis represents time in seconds and scaled time)

Resnet 18

The smallest Resnet of all.

  • Time in seconds:

  • Time-scaled:

Resnet 34

  • Time in seconds:

  • Time scaled:

Resnet 50

  • Time in seconds:

  • Time scaled:

Resnet 101

  • Time in seconds:

  • Time-scaled:

Resnet 152

  • Time in seconds:

  • Time-scaled:

World Level Language Modelling using Nvidia Apex

To allow experimentation of Mixed Precision and FP16 training, Nvidia has released Nvidia apex which is a set of NVIDIA-maintained utilities to streamline mixed precision and distributed training in Pytorch. The intention of Apex is to make up-to-date utilities available to users as quickly as possible.

Checkout the repo here

It also features a few examples that we can run directly without much tweaking-this seemed to be another good test for a quick spin.

Language Modelling comparison:

The example in the GitHub repo trains a multi-layer RNN (Elman, GRU, or LSTM) on a language modeling task. By default, the training script uses the Wikitext-2 dataset, provided. The trained model can then be used by the generate script to generate new text.

We weren’t concerned with the generation of test-our comparisons are based on training the example for 30 epochs on Mixed Precision, Full Precision for the same batch sizes on the different setups.

Enabling fp16 is as easy as passing a “ — fp16” argument while running the code, APEX works on top of the PyTorch environment that we had already setup. Hence this seemed to be a perfect choice.

Below are the results from the same:

  • Time (seconds)

  • Time (Scaled):

Conclusion

Although performance-wise the RTX cards are much more powerful than the 1080Ti, for smaller networks especially, the difference in train time isn’t as pronounced as I had expected.

If you decide to try Mixed Precision training, a few bonus points are:

  • Bigger batch sizes:
    In the test notebooks, we noticed an almost 1.8x increase in batch_size was consistent against all of the Resnet examples that we had tried.
  • Faster than Full precision training:
    If you look at the example of Resnet 101 where the difference is the highest, FP training takes 1.18x time on a 2080Ti and 1.13x time on a 2080Ti for our CIFAR-100 example. A slight speedup is always visible during the training, even for the “smaller” Resnet34 and Resnet50.
  • Similar accuracy values:
    We did not notice any drop in the accuracy values when using Mixed Precision training.
  • Ensure you’re using CUDA>9 and are on the latest version of Nvidia-drivers.

During the testing, we were not able to run the code until we had updated our environments.

  • Checkout fastai and Nvidia APEX

If you have any questions, please leave a note or comment below.

read original article here