Numenta + Intel achieve 123x inference performance improvement for BERT Transformers

« Back to Case Studies

Intel recently released the new 4th Generation Intel Xeon Scalable Processor (formerly codenamed Sapphire Rapids). Read our press release for more details on our collaboration with Intel in accelerating inference performance for large language models.

challenge

Meeting the high-throughput, low-latency demands of real-time NLP applications

For real-time Natural Language Processing (NLP) applications, high throughput, low latency technologies are a requirement. While Transformers have become fundamental within NLP, their size and complexity have made it nearly impossible to meet the rigorous performance demands cost effectively. An example application requiring both low latency and high throughput is Conversational AI, one of the fastest growing AI and NLP markets, expected to reach $40 bn by 2030.

SOLUTION

Numenta’s optimized models plus Intel’s latest Xeon Scalable Processor unlock performance gains

In collaboration with Intel, we were able to combine our proprietary, neuroscience-based technology for dramatically accelerating Transformer networks with Intel’s new Advanced Matrix Extensions (Intel AMX) available in the 4th Gen Intel Xeon Scalable processors (formerly codenamed Sapphire Rapids). This synergistic combination of algorithm advances and hardware advances led to unparalleled performance gains for BERT inference on short text sequences.

“Numenta and Intel are collaborating to deliver substantial performance gains to Numenta’s AI solutions through the Intel Xeon CPU Max Series and 4th Gen Intel Xeon Scalable processors. We’re excited to work together to unlock significant throughput performance accelerations for previously bandwidth-bound or latency-bound AI applications such as Conversational AI and large document processing.”

Scott Clark

Vice president and general manager of AI and HPC Application Level Engineering, Intel

RESULTS

123x throughput performance improvement and sub-9ms latencies for BERT-Large Transformers

We integrated our technology into Intel’s OpenVINO toolkit and compared the inference performance of Numenta ‘s BERT-Large model on the 4^th generation 56-core Intel Xeon Scalable Processor with performance of traditional ONNX BERT-Large SQuAD models on a variety of processors. The chart illustrates the throughput improvements achieved when comparing Numenta’s BERT model on the new Intel processor with an equivalent traditional BERT-Large model running on 48-core AMD Milan, 32-core Intel 3^rd generation, and 56-core Intel 4^th generation Xeon processors, with batch size of 1.

In this example, we optimized for latency, imposing a 10ms restriction that’s often used for real-time applications. Numenta’s BERT-Large model on Intel’s 4^th generation Xeon processor was the only combination able to achieve the sub-10ms latency threshold. These results illustrate a highly scalable cost-effective option for running the large deep learning models necessary for Conversational AI and other real-time AI applications.

BENEFITS

Breakthrough performance gains open new possibilities for real-time NLP and AI

The technology synergies between Numenta and Intel have turned Transformers from costly and complex to a highly-performant, cost-effective, ideal solution for real-time NLP industries like Conversational AI.

Industry-leading performance breakthroughs for Transformer networks, with 123x throughput speed-up while maintaining sub-9ms latencies
Cost-effective options for running the large deep learning models necessary for AI and NLP applications
New possibilities for many time-sensitive AI applications that can finally deploy Transformer models in production

ADDITIONAL RESOURCES

- Press Release: Numenta Achieves 123x Inference Performance Improvement for BERT Transformers on Intel Xeon Processor Family
- Blog: A New Performance Standard for BERT Transformers with Numenta + Intel
- Intel Developer page: Intel AI Platform Overview

Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries.

Ready to supercharge your AI solutions with NuPIC?

Related Case Studies

Developing AI-powered games on existing CPU infrastructures without breaking the bank

AI is opening a new frontier for gaming, enabling more immersive and interactive experiences than ever before. NuPIC enables game studios and developers to leverage these AI technologies on existing CPU infrastructure as they embark on building new AI-powered games.

20x inference acceleration for long sequence length tasks on Intel Xeon Max Series CPUs

Numenta technologies running on the Intel 4th Gen Xeon Max Series CPU enables unparalleled performance speedups for longer sequence length tasks.