20x inference acceleration for long sequence length tasks on Intel Xeon Max Series CPUs

Intel recently released the Xeon Max Series CPU, a version of the 4th Gen Intel Xeon Scalable processors. It is the first x86 CPU with high bandwidth memory (HBM), enabling 3x the bandwidth capacity of processors without this technology.

CHALLENGE

Processing Long Text Sequences Often Faces Bandwidth Limitations

Large language models (LLMs), such as GPT and BERT, are becoming increasingly sophisticated, with the ability to generate human-like responses and analyze large amounts of unstructured data. However, running these models efficiently in production for long sequence length tasks, such as analysis of long articles or documents, has become a challenge, as the models often face memory bandwidth limitations. As a result, the processor’s computational power is frequently underutilized, forcing many to run inference on GPUs instead of CPUs, which is far less cost-efficient and much more time-intensive to maintain.

SOLUTION

Running Numenta’s AI Inference Server on Intel Xeon CPU Max Series processors

To demonstrate the advantages of Numenta running on  Intel’s latest processor, we chose to show BERT-Large inference throughput improvements at long 512-element sequence lengths. Leveraging Intel’s AMX instruction set, Numenta applied our neuroscience-based techniques to a custom-trained version of BERT-Large model on the Intel Xeon CPU Max Series processor and compared it with a standard BERT-Large model running on a current AMD Milan system.

“Numenta and Intel are collaborating to deliver substantial performance gains to Numenta’s AI solutions through the Intel Xeon CPU Max Series and 4th Gen Intel Xeon Scalable processors. We’re excited to work together to unlock significant throughput performance accelerations for previously bandwidth-bound or latency-bound AI applications such as Conversational AI and large document processing.”

Scott Clark

Vice President and General Manager of AI and HPC Application Level Engineering, Intel

RESULTS

20x inference throughput for long sequence lengths

This synergistic combination of Numenta and Intel technology led to a 20x gain in inference throughput for LLMs with extended sequence lengths compared to AMD Milan processors. Furthermore, Numenta’s optimized model still gets an order ot magnitude speed-up on the 4th Gen Xeon CPU without the HBM capability, which is more adept at handling shorter sequence lengths.

For long sequence length of 512, this combination of technology achieved 20x throughput over current generation AMD Milan.

BENEFITS

Slash Costs, Boost Efficiency for Large Language Models

Running Numenta Models on Intel’s high bandwidth CPUs enables unparalleled performance speedups for longer sequence length tasks, dramatically reducing the overall cost of running language models in production. This speedup allows customers to:

  • Process large documents with high sequence lengths at impressive speeds without sacrificing accuracy
  • Realize significant cost savings
  • Eliminate the need for resource-intensive GPUs
  • Unlock new NLP capabilities for diverse applications without breaking the bank

ADDITIONAL RESOURCES

Press Release: Numenta Achieves 123x Inference Performance Improvement for BERT Transformers on Intel Xeon Processor Family
Technical Blog: Numenta and Intel Accelerate Inference
Xeon Series Product Brief:  Intel® Xeon® CPU Max Series Product Brief

 

Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries.

Interested in working with us?

Related Case Studies