Why AI progress is faster than Moore’s Law — the age of the algorithm

Moore’s original 1965 paper Cramming more components onto integrated circuits contained a number of incredible insights. One of them has been shorthanded into “computers get 2x faster every 2 years.”.

Moore’s vision has come to pass. You can buy a “handy home computer” at walmart and pick up some deodorant at the same time.

Figure from Moore’s paper

Computer algorithms, for non machine learning problems, have not had quite as much improvement. Quick sort for example, a commonly used sorting algorithm, is turning 60 years old next year.

animation of Quick sort from wikipedia

Deep learning algorithm improvement

Machine learning models have been progressing at a much faster rate. Here’s an example, a comparison of a sub system used in many object detection systems called “region proposal networks”:

mAP” is a measure of how good the network is. Being able to achieve a similar score at a reduced time is a goal here. Faster R-CNN gets a 250x speedup compare to the original approach, and 10x faster than Fast R-CNN.

So how long did it take for Faster to get a 10x improvement? A few years?

No.

Both Fast and Faster were published in the same year, 2015. Yes that’s right we saw a 10x speed up during the same year.

We see the error, or how many mistakes the AI makes, reduce over time too. For example in the the ImageNet large scale challenge we the error has gone down from 28% to 2% with Squeeze and Excite net. The big jump in 2012 was the switch to deep learning based approach with AlexNet getting an incredible 27,571 citations as of time of writing.

See http://www.image-net.org/challenges/LSVRC/

Algorithms that benefit from more data

Deep learning models benefit from more data. They make use of more data at train time to produce a better quality model. When the model gets used (at test time) it runs at a similar speed, regardless of the amount of data the original network was trained on³.

This is in direct contrast to “traditional” programming that typically tries to reduce the amount of data being used in the system. In fact there’s a whole language called Big O dedicated to this.

The availability of data is increasing. For example for cameras. One way the increased availability of data helps machine learning models is transfer learning. More data may mean more powerful pre-trained models and easier access to new data for fine tuning.

Left: North America camera module market by application, 2012–2022, (USD Million)

Specialized hardware

Specialized hardware, such as GPUs are also helping. This has been covered in a lot of depth by others so I’ll just briefly show this example:

https://www.rtinsights.com/gpus-the-key-to-cognitive-computing/

Here we see a 10x performance improvement over 6 years. Where as a doubling in performance every 2 years would only equate to a 4x improvement. Future improvements are likely with even more specific circuit design.

Progress

In contrast to traditional programming, AI algorithms has been improving at a faster pace. Hardware designed for these new algorithms drives further progress.

Algorithms + specialized hardware is driving this

But why is the hardware part faster than Moore’s law? The difference is hardware designed for a “specific” purpose, vs generally adding more power. For example, a graphics unit is not always better than a central processor.

If you cellphone ran exclusively on a graphics unit your battery may go dead halfway through the day and it would likely feel sluggish to do common tasks.

However, specialized hardware does see a big return on tasks it is good at, such as deep learning. This specialized hardware can be constructed in a way that bypasses CPU limitations.

“And these cores can be added with a linear increase in computational ability, bypassing today’s Moore’s Law limits…” — Bruce Pile¹

Ok but what about algorithms? How is that really different?

In instruction driven programming we have “provably” optimal ways of doing certain things. If the assumptions underlying the proof are correct, the proposed method is the best we “think!” we will ever have at that problem.

We haven’t gotten to “provably optimal” yet with deep learning based approaches!

There is a huge effort in the community to progress the state of the art, to discover entire new approaches and optimizes existing ones. A specific example of this is Capsule networks.

Capsule networks operate on a list of information ie [1, 4, 65, 1] at it’s “lowest” level, in comparison to normal networks that operate on a single value, ie “1”. As you can imagine going from only 1 number, to an unlimited list of numbers opens up a whole new world of possibilities.

read original article here