Intel recently released the new 4th Generation Intel Xeon Scalable Processor (formerly codenamed Sapphire Rapids). Read our press release for more details on our collaboration with Intel in accelerating inference performance for large language models.
Meeting the high-throughput, low-latency demands of real-time NLP applications
For real-time Natural Language Processing (NLP) applications, high throughput, low latency technologies are a requirement. While Transformers have become fundamental within NLP, their size and complexity have made it nearly impossible to meet the rigorous performance demands cost effectively. An example application requiring both low latency and high throughput is Conversational AI, one of the fastest growing AI and NLP markets, expected to reach $40 bn by 2030.
In collaboration with Intel, we were able to combine our proprietary, neuroscience-based technology for dramatically accelerating Transformer networks with Intel’s new Advanced Matrix Extensions (Intel AMX) available in the 4th Gen Intel Xeon Scalable processors (formerly codenamed Sapphire Rapids). This synergistic combination of algorithm advances and hardware advances led to unparalleled performance gains for BERT inference on short text sequences.
“Numenta and Intel are collaborating to deliver substantial performance gains to Numenta’s AI solutions through the Intel Xeon CPU Max Series and 4th Gen Intel Xeon Scalable processors. We’re excited to work together to unlock significant throughput performance accelerations for previously bandwidth-bound or latency-bound AI applications such as Conversational AI and large document processing.”
Vice president and general manager of AI and HPC Application Level Engineering, Intel
123x throughput performance improvement and sub-9ms latencies for BERT-Large Transformers
We integrated our technology into Intel’s OpenVINO toolkit and compared the inference performance of Numenta ‘s BERT-Large model on the 4th generation 56-core Intel Xeon Scalable Processor with performance of traditional ONNX BERT-Large SQuAD models on a variety of processors. The chart illustrates the throughput improvements achieved when comparing Numenta’s BERT model on the new Intel processor with an equivalent traditional BERT-Large model running on 48-core AMD Milan, 32-core Intel 3rd generation, and 56-core Intel 4th generation Xeon processors, with batch size of 1.
In this example, we optimized for latency, imposing a 10ms restriction that’s often used for real-time applications. Numenta’s BERT-Large model on Intel’s 4th generation Xeon processor was the only combination able to achieve the sub-10ms latency threshold. These results illustrate a highly scalable cost-effective option for running the large deep learning models necessary for Conversational AI and other real-time AI applications.
Breakthrough performance gains open new possibilities for real-time NLP and AI
The technology synergies between Numenta and Intel have turned Transformers from costly and complex to a highly-performant, cost-effective, ideal solution for real-time NLP industries like Conversational AI.
Industry-leading performance breakthroughs for Transformer networks, with 123x throughput speed-up while maintaining sub-9ms latencies
Cost-effective options for running the large deep learning models necessary for AI and NLP applications
New possibilities for many time-sensitive AI applications that can finally deploy Transformer models in production
With our neuroscience-based optimization techniques, we shift the model accuracy scaling laws such that at a fixed cost, or a given performance level, our models achieve higher accuracies than their standard counterparts.