Numenta + Intel achieve 123x inference performance improvement for BERT Transformers

Read our press release for more details on our collaboration with Intel in accelerating inference performance on the new 4th Generation Intel Xeon Scalable Processor (formerly codenamed Sapphire Rapids) and the Intel Xeon CPU Max Series (formerly codenamed Sapphire Rapids + HBM).

CHALLENGE

Meeting the high-throughput, low-latency demands of real-time NLP applications

For real-time Natural Language Processing (NLP) applications, high throughput, low latency technologies are a requirement. While Transformers have become fundamental within NLP, their size and complexity have made it nearly impossible to meet the rigorous performance demands cost effectively. An example application requiring both low latency and high throughput is Conversational AI, one of the fastest growing AI and NLP markets, expected to reach $40 bn by 2030.

SOLUTION

Numenta’s optimized models plus Intel’s latest Xeon Scalable Processor unlock performance gains

In collaboration with Intel, we were able to combine our proprietary, neuroscience-based technology for dramatically accelerating Transformer networks with Intel’s new Advanced Matrix Extensions (Intel AMX) available in the 4th Gen Intel Xeon Scalable processors (formerly codenamed Sapphire Rapids). This synergistic combination of algorithm advances and hardware advances led to unparalleled performance gains for BERT inference on short text sequences.

“Numenta and Intel are collaborating to deliver substantial performance gains to Numenta’s AI solutions through the Intel Xeon CPU Max Series and 4th Gen Intel Xeon Scalable processors. We’re excited to work together to unlock significant throughput performance accelerations for previously bandwidth-bound or latency-bound AI applications such as Conversational AI and large document processing.”

Scott Clark

Vice president and general manager of AI and HPC Application Level Engineering, Intel

RESULTS

123x throughput performance improvement and sub-9ms latencies for BERT-Large Transformers

We integrated our technology into Intel’s OpenVINO toolkit and compared the inference performance of Numenta ‘s BERT-Large model on the 4th generation 56-core Intel Xeon Scalable Processor with performance of traditional ONNX BERT-Large SQuAD models on a variety of processors. The chart illustrates the throughput improvements achieved when comparing Numenta’s BERT model on the new Intel processor with an equivalent traditional BERT-Large model running on 48-core AMD Milan, 32-core Intel 3rd generation, and 56-core Intel 4th generation Xeon processors, with batch size of 1.

In this example, we optimized for latency, imposing a 10ms restriction that’s often used for real-time applications. Numenta’s BERT-Large model on Intel’s 4th generation Xeon processor was the only combination able to achieve the sub-10ms latency threshold. These results illustrate a highly scalable cost-effective option for running the large deep learning models necessary for Conversational AI and other real-time AI applications.

Numenta’s neuroscience-based AI solutions achieve two orders of magnitude performance improvement for BERT Transformers on the new Intel Xeon Processor.

 

BENEFITS

Breakthrough performance gains open new possibilities for real-time NLP and AI

The technology synergies between Numenta and Intel have turned Transformers from costly and complex to a highly-performant, cost-effective, ideal solution for real-time NLP industries like Conversational AI.

  • Industry-leading performance breakthroughs for Transformer networks, with 123x throughput speed-up while maintaining sub-9ms latencies
  • Cost-effective options for running the large deep learning models necessary for AI and NLP applications
  • New possibilities for many time-sensitive AI applications that can finally deploy Transformer models in production

Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries.

Interested in working with us?

Related Case Studies