Breaking through the latency barrier in Conversational AI


Meeting industry latency requirements while reducing costs

Conversational AI, which refers to technologies that allow consumers to engage in human-like interactions with computers, is projected to be a $40 billion industry by 2030.  As it continues to grow, so does the demand for real-time AI. While large language models like transformers have become fundamental for many Natural Language Processing applications, their complexity and size create considerable latency bottlenecks. As a result, it is challenging and costly for companies to deploy these models for real-time applications. 


Numenta Optimized Inference

With unique acceleration techniques built on neuroscience insights, our optimizations deliver high inference throughput at ultra low latencies on off-the-shelf CPUs. Our solutions enable customers to easily integrate and deploy our technology on conventional hardware.


Achieving <3ms latency on CPUs

Time-sensitive applications, like virtual assistants and online chats, require ultra-low latencies, typically under 10ms. When we optimize for latency, our BERT-Base model is well within the target, achieving 2.5ms latency. This is more than 5X faster than the standard BERT-Base model running on Microsoft’s hand-optimized ONNX runner.

Models are running BERT-Base on an Intel Xeon server (AWS m6i.4xlarge), with Sequence Length 64, Batch Size 1, using 1 Socket and 4 Cores

In this example, although we are optimizing for latency we still get a 5.88x throughput speed-up as shown in the chart below.

Models are running BERT-Base on m6i.4xlarge, with Sequence Length 64, Batch Size 1, 1 Socket and 4 Cores


Turning Transformers into an attractive real-time AI solution

Our sub-3ms latency results open new possibilities for companies with time-sensitive AI applications, who can finally deploy Transformer models in production:

  • Reduces cost and complexity of Transformers
  • Makes Transformers a high-performance, cost-effective solution
  • Enables new real-time AI and NLP applications

