
Turbo-charging CPU inference throughput with optimized transformer networks
Numenta Transformer models significantly accelerate CPU inference while maintaining competitive accuracy.
Read our press release for more details on our collaboration with Intel in accelerating inference performance on the new 4th Generation Intel Xeon Scalable Processor (formerly codenamed Sapphire Rapids) and the Intel Xeon CPU Max Series (formerly codenamed Sapphire Rapids + HBM).
For real-time Natural Language Processing (NLP) applications, high throughput, low latency technologies are a requirement. While Transformers have become fundamental within NLP, their size and complexity have made it nearly impossible to meet the rigorous performance demands cost effectively. An example application requiring both low latency and high throughput is Conversational AI, one of the fastest growing AI and NLP markets, expected to reach $40 bn by 2030.
In collaboration with Intel, we were able to combine our proprietary, neuroscience-based technology for dramatically accelerating Transformer networks with Intel’s new Advanced Matrix Extensions (Intel AMX) available in the 4th Gen Intel Xeon Scalable processors (formerly codenamed Sapphire Rapids). This synergistic combination of algorithm advances and hardware advances led to unparalleled performance gains for BERT inference on short text sequences.
“Numenta and Intel are collaborating to deliver substantial performance gains to Numenta’s AI solutions through the Intel Xeon CPU Max Series and 4th Gen Intel Xeon Scalable processors. We’re excited to work together to unlock significant throughput performance accelerations for previously bandwidth-bound or latency-bound AI applications such as Conversational AI and large document processing.”
Scott Clark
Vice president and general manager of AI and HPC Application Level Engineering, Intel
We integrated our technology into Intel’s OpenVINO toolkit and compared the inference performance of Numenta ‘s BERT-Large model on the 4th generation 56-core Intel Xeon Scalable Processor with performance of traditional ONNX BERT-Large SQuAD models on a variety of processors. The chart illustrates the throughput improvements achieved when comparing Numenta’s BERT model on the new Intel processor with an equivalent traditional BERT-Large model running on 48-core AMD Milan, 32-core Intel 3rd generation, and 56-core Intel 4th generation Xeon processors, with batch size of 1.
In this example, we optimized for latency, imposing a 10ms restriction that’s often used for real-time applications. Numenta’s BERT-Large model on Intel’s 4th generation Xeon processor was the only combination able to achieve the sub-10ms latency threshold. These results illustrate a highly scalable cost-effective option for running the large deep learning models necessary for Conversational AI and other real-time AI applications.
The technology synergies between Numenta and Intel have turned Transformers from costly and complex to a highly-performant, cost-effective, ideal solution for real-time NLP industries like Conversational AI.
Press Release: Numenta Achieves 123x Inference Performance Improvement for BERT Transformers on Intel Xeon Processor Family
Blog: A New Performance Standard for BERT Transformers with Numenta + Intel
Intel Developer page: Intel AI Platform Overview
Xeon Series Product Brief: Intel® Xeon® CPU Max Series Product Brief
Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries.
Numenta Transformer models significantly accelerate CPU inference while maintaining competitive accuracy.
With unique acceleration techniques built on neuroscience insights, our AI platform delivers high throughput at target low latencies for inference on CPUs.
Numenta optimized networks deliver throughput acceleration, energy savings and enable new potential for edge applications.