Cerebras Systems throws down gauntlet to Nvidia with launch of ‘world’s fastest’ AI inference service

Ambitious artificial intelligence computing startup Cerebras Systems Inc. is raising the stakes in its battle against Nvidia Corp., launching what it says is the world’s fastest AI inference service, and it’s available now within the cloud.

AI inference refers back to the technique of running live data through a trained AI model to make a prediction or solve a task. Inference services are the workhorse of the AI industry, and in line with Cerebras, it’s the fastest-growing segment too, accounting for about 40% of all AI workloads within the cloud today.

Nonetheless, existing AI inference services don’t appear to satisfy the needs of each customer. “We’re seeing all styles of interest in easy methods to get inference done faster and for less money,” Chief Executive Andrew Feldman told a gathering of reporters in San Francisco Monday.

The corporate intends to deliver on this with its latest “high-speed inference” services. It believes the launch is a watershed moment for the AI industry, saying that 1,000-tokens-per-second speeds it could deliver is comparable to the introduction of broadband web, enabling game-changing latest opportunities for AI applications.

Raw power

Cerebras is well-equipped to supply such a service. The corporate is a producer of specialised and powerful computer chips for AI and high-performance computing or HPC workloads. It has made a lot of headlines over the past 12 months, claiming that its chips are usually not only more powerful than Nvidia’s graphics processing units, but additionally less expensive. “That is GPU-impossible performance,” declared co-founder and Chief Technology Officer Sean Lie.

Its flagship product is the brand new WSE-3 processor (pictured), which was announced in March and builds upon its earlier WSE-2 chipset that debuted in 2021. It’s built on a complicated five-nanometer process and features 1.4 trillion transistors greater than its predecessor chip, with greater than 900,000 compute cores and 44 gigabytes of onboard static random-access memory. In line with the startup, the WSE-3 has 52 times more cores than a single Nvidia H100 graphics processing unit.

The chip is obtainable as a part of an information center appliance called the CS-3, which is concerning the same size as a small refrigerator. The chip itself is concerning the same size as a pizza, and comes with integrated cooling and power delivery modules. When it comes to performance, the Cerebras WSE-3 is claimed to be twice as powerful because the WSE-2, able to hitting a peak speed of 125 petaflops, with 1 petaflop equal to 1,000 trillion computations per second.

The Cerebras CS-3 system is the engine that powers the brand new Cerebras Inference service, and it notably features 7,000 times greater memory than the Nvidia H100 GPU to unravel one in every of generative AI’s fundamental technical challenges: the necessity for more memory bandwidth.

Impressive speeds at lower cost

It solves that challenge in style. The Cerebras Inference service is claimed to be lightning quick, as much as 20 times faster than comparable cloud-based inference services that use Nvidia’s strongest GPUs. In line with Cerebras, it delivers 1,800 tokens per second for the open-source Llama 3.1 8B model, and 450 tokens per second for Llama 3.1 70B.

It’s competitively priced too, with the startup saying that the service starts at just 10 cents per million tokens – equating to 100 times higher price-performance for AI inference workloads.

The corporate adds the Cerebras Inference service is particularly well-suited for “agentic AI” workloads, or AI agents that may perform tasks on behalf of users, as such applications need the flexibility to consistently prompt their underlying models

Micah Hill-Smith, co-founder and chief executive of the independent AI model evaluation company Artificial Evaluation Inc., said his team has verified that Llama 3.1 8B and 70B running on Cerebras Inference achieves “quality evaluation results” in keeping with native 16-bit precision per Meta’s official versions.

“With speeds that push the performance frontier and competitive pricing, Cerebras Inference is especially compelling for developers of AI applications with real-time or high-volume requirements,” he said.

Tiered access

Customers can decide to access the Cerebras Inference service three available tiers, including a free offering that gives application programming interface-based access and generous usage limits for anyone who desires to experiment with the platform.

The Developer Tier is for flexible, serverless deployments. It’s accessed via an API endpoint that the corporate says costs a fraction of the worth of other services available today. For example, Llama 3.1 8B is priced at just 10 cents per million tokens, while Llama 3.1 70B costs 60 cents. Support for extra models is on the way in which, the corporate said.

There’s also an Enterprise Tier, which offers fine-tuned models and custom service level agreements with dedicated support. That’s for sustained workloads, and it could be accessed via a Cerebras-managed private cloud or else implemented on-premises. Cerebras isn’t revealing the fee of this particular tier but says pricing is obtainable on request.

Cerebras claims a formidable list of early-access customers, including organizations comparable to GlaxoSmithKline Plc., the AI search engine startup Perplexity AI Inc. and the networking analytics software provider Meter Inc.

Dr. Andrew Ng, founding father of DeepLearning AI Inc., one other early adopter, explained that his company has developed multiple agentic AI workflows that require prompting an LLM repeatedly to acquire a result. “Cerebras has built an impressively fast inference capability that shall be very helpful for such workloads,” he said.

Cerebras’ ambitions don’t end there. Feldman said the corporate is “engaged with multiple hyperscalers” about offering its capabilities on their cloud services. “We aspire to have them as customers,” he said, in addition to AI specialty providers comparable to CoreWeave Inc. and Lambda Inc.

Besides the inference service, Cerebras also announced a lot of strategic partnerships to offer its customers with access to all the specialized tools required to speed up AI development. Its partners include the likes of LangChain, LlamaIndex, Docker Inc., Weights & Biases Inc. and AgentOps Inc.

Cerebras said its Inference API is fully compatible with OpenAI’s Chat Completions API, which implies existing applications could be migrated to its platform with just a number of lines of code.

With reporting by Robert Hof

Photo: Cerebras Systems

Your vote of support is essential to us and it helps us keep the content FREE.

One click below supports our mission to offer free, deep, and relevant content.

Join our community on YouTube

Join the community that features greater than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and lots of more luminaries and experts.

“TheCUBE is a vital partner to the industry. You guys really are a component of our events and we actually appreciate you coming and I do know people appreciate the content you create as well” – Andy Jassy

THANK YOU

Categories

Site Map

Cerebras Systems throws down gauntlet to Nvidia with launch of ‘world’s fastest’ AI inference service

Raw power

Impressive speeds at lower cost

Tiered access

Photo: Cerebras Systems

Your vote of support is essential to us and it helps us keep the content FREE.

One click below supports our mission to offer free, deep, and relevant content.

Join our community on YouTube

Join the community that features greater than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and lots of more luminaries and experts.

LEAVE A REPLY Cancel reply

Agni Trailer: Pratik Gandhi and Divyenndu Narrate The Tale of Firefighters

Should the US ban Chinese drones?

Ally McCoist reveals he’s been affected by incurable condition that two operations couldn’t fix

Keke Palmer Gags Shannon Sharpe: Joke On Raunchy Livestream

Minecraft Food Tier List

More like this
Related

Agni Trailer: Pratik Gandhi and Divyenndu Narrate The Tale of Firefighters

Should the US ban Chinese drones?

Ally McCoist reveals he’s been affected by incurable condition that two operations couldn’t fix

Keke Palmer Gags Shannon Sharpe: Joke On Raunchy Livestream

TrendWired Solutions Network

Site Map

The latest

Agni Trailer: Pratik Gandhi and Divyenndu Narrate The Tale of Firefighters

Should the US ban Chinese drones?

Ally McCoist reveals he’s been affected by incurable condition that two operations couldn’t fix

Our Newsletter

Categories

Site Map

Cerebras Systems throws down gauntlet to Nvidia with launch of ‘world’s fastest’ AI inference service

Raw power

Impressive speeds at lower cost

Tiered access

Photo: Cerebras Systems

Your vote of support is essential to us and it helps us keep the content FREE.

One click below supports our mission to offer free, deep, and relevant content.

Join the community that features greater than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and lots of more luminaries and experts.

LEAVE A REPLY Cancel reply

More like thisRelated

TrendWired Solutions Network

Site Map

The latest

Our Newsletter

More like this
Related