Chang She, previously the VP of engineering at Tubi and a Cloudera veteran, has years of experience constructing data tooling and infrastructure. But when She began working within the AI space, he quickly bumped into problems with traditional data infrastructure — problems that prevented him from bringing AI models into production.
“Machine learning engineers and AI researchers are sometimes stuck with a subpar development experience,” She told TechCrunch in an interview. “Data infra firms don’t really understand the issue for machine learning data at a fundamental level.”
So Chang — who’s certainly one of the co-creators of Pandas, the wildly popular Python data science library — teamed up with software engineer Lei Xu to co-launch LanceDB.
LanceDB is constructing the eponymous open source database software LanceDB, which is designed to support multimodal AI models — models that train on and generate images, videos and more along with text. Backed by Y Combinator, LanceDB this month raised $8 million in a seed funding round led by CRV, Essence VC and Swift Ventures, bringing its total raised to $11 million.
“If multimodal AI is critical to the longer term success of your organization, you would like your very expensive AI team to concentrate on the model and bridging the AI with business value,” Chang said. “Unfortunately, today, AI teams are spending most of their time coping with low-level data infrastructure details. LanceDB provides the muse AI teams need so that they will be free to concentrate on what really matters for enterprise value and produce AI products to market much faster than otherwise possible.”
LanceDB is actually a vector database — a database containing series of numbers (“vectors”) that encode the meaning of unstructured data (e.g. images, text and so forth).
As my colleague Paul Sawers recently wrote, vector databases are having a moment because the AI hype cycle peaks. That’s because they’re useful for all manner of AI applications, from content recommendations in ecommerce and social media platforms to reducing hallucinations.
The vector database competition is fierce — see Qdrant, Vespa, Weaviate, Pinecone and Chroma to call a couple of vendors (not counting the Big Tech incumbents). So what makes LanceDB unique? Higher flexibility, performance and scalability, in response to Chang.
For one, Chang says, LanceDB — which is built on top of Apache Arrow — is powered by a custom data format, Lance Format, that’s optimized for multimodal AI training and analytics. Lance Format enables LanceDB to handle as much as billions of vectors and petabytes of text, images and videos, and to permit engineers to administer various types of metadata related to that data.
“Until now, there’s never been a system that may unite training, exploration, search and large-scale data processing,” Chang said. “Lance Format allows AI researchers and engineers to have a single source of truth and get lightning-fast performance across their entire AI pipeline. It’s not nearly storing vectors.”
LanceDB makes money by selling fully managed versions of its open source software with added features comparable to hardware acceleration and governance controls — and business appears to be going strong. The corporate’s customer list includes text-to-image platform Midjourney, chatbot unicorn Character.ai, autonomous automobile startup WeRide and Airtable.
Chang insisted that LanceDB’s recent VC backing wouldn’t shift its attention away from the open source project, though, which he says is now seeing around 600,000 downloads monthly.
“We desired to create something that may make it 10x easier for AI teams working with large-scale multimodal data,” he said. “LanceDB offers — and can proceed to supply — a really wealthy set of ecosystem integrations to attenuate adoption effort.”