20x inference acceleration for long sequence length tasks on Intel Xeon Max Series CPUs

« Back to Case Studies

Intel recently released the Xeon Max Series CPU, a version of the 4th Gen Intel Xeon Scalable processors. It is the first x86 CPU with high bandwidth memory (HBM), enabling 3x the bandwidth capacity of processors without this technology.

challenge

Processing Long Text Sequences Often Faces Bandwidth Limitations

Large language models (LLMs), such as GPT and BERT, are becoming increasingly sophisticated, with the ability to generate human-like responses and analyze large amounts of unstructured data. However, running these models efficiently in production for long sequence length tasks, such as analysis of long articles or documents, has become a challenge, as the models often face memory bandwidth limitations. As a result, the processor’s computational power is frequently underutilized, forcing many to run inference on GPUs instead of CPUs, which is far less cost-efficient and much more time-intensive to maintain.

SOLUTION

Running Numenta’s AI Inference Server on Intel Xeon CPU Max Series processors

To demonstrate the advantages of Numenta running on Intel’s latest processor, we chose to show BERT-Large inference throughput improvements at long 512-element sequence lengths. Leveraging Intel’s AMX instruction set, Numenta applied our neuroscience-based techniques to a custom-trained version of BERT-Large model on the Intel Xeon CPU Max Series processor and compared it with a standard BERT-Large model running on a current AMD Milan system.

“Numenta and Intel are collaborating to deliver substantial performance gains to Numenta’s AI solutions through the Intel Xeon CPU Max Series and 4th Gen Intel Xeon Scalable processors. We’re excited to work together to unlock significant throughput performance accelerations for previously bandwidth-bound or latency-bound AI applications such as Conversational AI and large document processing.”

Scott Clark

Vice president and general manager of AI and HPC Application Level Engineering, Intel

RESULTS

20x inference throughput for long sequence lengths

This synergistic combination of Numenta and Intel technology led to a 20x gain in inference throughput for LLMs with extended sequence lengths compared to AMD Milan processors. Furthermore, Numenta’s optimized model still gets an order ot magnitude speed-up on the 4th Gen Xeon CPU without the HBM capability, which is more adept at handling shorter sequence lengths.

BENEFITS

Slash Costs, Boost Efficiency for Large Language Models

Running Numenta Models on Intel’s high bandwidth CPUs enables unparalleled performance speedups for longer sequence length tasks, dramatically reducing the overall cost of running language models in production. This speedup allows customers to:

Process large documents with high sequence lengths at impressive speeds without sacrificing accuracy
Realize significant cost savings
Eliminate the need for resource-intensive GPUs
Unlock new NLP capabilities for diverse applications without breaking the bank

ADDITIONAL RESOURCES

- Press Release: Numenta Achieves 123x Inference Performance Improvement for BERT Transformers on Intel Xeon Processor Family
- Technical Blog: Numenta and Intel Accelerate Inference
- Xeon Series Product Brief: Intel® Xeon® CPU Max Series Product Brief

Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries.

Ready to supercharge your AI solutions with NuPIC?

Related Case Studies

Developing AI-powered games on existing CPU infrastructures without breaking the bank

AI is opening a new frontier for gaming, enabling more immersive and interactive experiences than ever before. NuPIC enables game studios and developers to leverage these AI technologies on existing CPU infrastructure as they embark on building new AI-powered games.