With our neuroscience-based optimization techniques, we shift the model accuracy scaling laws such that at a fixed cost, or a given performance level, our models achieve higher accuracies than their standard counterparts.
Intel recently released the Xeon Max Series CPU, a version of the 4th Gen Intel Xeon Scalable processors. It is the first x86 CPU with high bandwidth memory (HBM), enabling 3x the bandwidth capacity of processors without this technology.
Processing Long Text Sequences Often Faces Bandwidth Limitations
Large language models (LLMs), such as GPT and BERT, are becoming increasingly sophisticated, with the ability to generate human-like responses and analyze large amounts of unstructured data. However, running these models efficiently in production for long sequence length tasks, such as analysis of long articles or documents, has become a challenge, as the models often face memory bandwidth limitations. As a result, the processor’s computational power is frequently underutilized, forcing many to run inference on GPUs instead of CPUs, which is far less cost-efficient and much more time-intensive to maintain.
Running Numenta’s AI Inference Server on Intel Xeon CPU Max Series processors
To demonstrate the advantages of Numenta running on Intel’s latest processor, we chose to show BERT-Large inference throughput improvements at long 512-element sequence lengths. Leveraging Intel’s AMX instruction set, Numenta applied our neuroscience-based techniques to a custom-trained version of BERT-Large model on the Intel Xeon CPU Max Series processor and compared it with a standard BERT-Large model running on a current AMD Milan system.
“Numenta and Intel are collaborating to deliver substantial performance gains to Numenta’s AI solutions through the Intel Xeon CPU Max Series and 4th Gen Intel Xeon Scalable processors. We’re excited to work together to unlock significant throughput performance accelerations for previously bandwidth-bound or latency-bound AI applications such as Conversational AI and large document processing.”
Vice President and General Manager of AI and HPC Application Level Engineering, Intel
20x inference throughput for long sequence lengths
This synergistic combination of Numenta and Intel technology led to a 20x gain in inference throughput for LLMs with extended sequence lengths compared to AMD Milan processors. Furthermore, Numenta’s optimized model still gets an order ot magnitude speed-up on the 4th Gen Xeon CPU without the HBM capability, which is more adept at handling shorter sequence lengths.
Slash Costs, Boost Efficiency for Large Language Models
Running Numenta Models on Intel’s high bandwidth CPUs enables unparalleled performance speedups for longer sequence length tasks, dramatically reducing the overall cost of running language models in production. This speedup allows customers to:
- Process large documents with high sequence lengths at impressive speeds without sacrificing accuracy
- Realize significant cost savings
- Eliminate the need for resource-intensive GPUs
- Unlock new NLP capabilities for diverse applications without breaking the bank
Press Release: Numenta Achieves 123x Inference Performance Improvement for BERT Transformers on Intel Xeon Processor Family
Technical Blog: Numenta and Intel Accelerate Inference
Xeon Series Product Brief: Intel® Xeon® CPU Max Series Product Brief
Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries.
Interested in working with us?
Related Case Studies
Numenta technologies combined with the new Advanced Matrix Extensions (Intel AMX) in the 4th Gen Intel Xeon Scalable processors yield breakthrough results.
Numenta Transformer models significantly accelerate CPU inference while maintaining competitive accuracy.