Cleanlab Raises $25 Million To Help Solve AI Models’ Data Mess

Cleanlab founders Curtis Northcutt, Anish Athalye and Jonas Mueller are hoping to solve the data problem of “garbage in, garbage out.”

The startup based on a popular open-source project for fixing data problems in AI models now counts cloud heavyweight Databricks as an investor and partner.

When OpenAI’s ChatGPT adds chocolate strawberry Cheerios to a tofu recipe, or Amazon’s Alexa declares the 2020 election was stolen from Donald Trump, the same thing is happening to two very different kinds of chatbot: under the hood, there’s a flawed data set, rife with duplicate, incorrect or misleading data points.

To an alert user who spots them, such mistakes, known as hallucinations, can look random. But there’s a simple-sounding computer science principle behind them: “garbage in, garbage out.” Feed every picture of a banana on the internet into an AI model, and it won’t innately know if you also included a photo of Curious George; that’s typically the job of labeling software and human contractors to filter out. But at a big enough scale, it’s almost inevitable that something will slip through — and the model generates an image of a fruit with a tail.

Enter Cleanlab, a two-year-old startup cofounded by three MIT PhDs, that offers software it claims can automatically fix the mess. Throw a raw, un-labelled data set at their product and it will automatically label up to 90% of it on a first pass, CEO and cofounder Curtis Northcutt told Forbes; labeled or not, Cleanlab also flags the data points and labels it thinks are most likely to be duplicates or errors, helping users to scrub it faster and cheaper for a more accurate end result.

“The reality is that every single solution that’s data-driven — and the world has never been more data-driven — is going to be affected by the quality of the data,” said Northcutt, who ran into the problem in stints at Amazon, Google, Meta and Microsoft. “It was ridiculous that there was no solution for this, no company filling the gap.”

A free, open-source version of Cleanlab’s software has been available since 2017; teams from the likes of Chase, Google and Tesla count among its users to date. Northcutt and cofounders Jonas Mueller and Anish Athalye only announced their paid, enterprise version, Cleanlab Studio, in July. Now, Cleanlab has raised another $25 million in a red-hot funding round that had at least one VC camped out at coffee shops near Northcutt’s San Francisco house in an unsuccessful last-ditch bid to get in on the deal. Menlo Ventures and TQ Ventures co-led the Series A, which valued Cleanlab at $100 million.

Joining the round — and partnering with fledgling Cleanlab — is Databricks, the $43 billion-valued No. 2 on Forbes’ Cloud 100 list that provides data infrastructure to large corporations like AT&T and Toyota. A Databricks test earlier this year that used Cleanlab to fine-tune an OpenAI Davinci model made available by API found that the process reduced errors by 37% and increased test accuracy from 65% to 78% overall, without any additional resources.

Consulting firm Berkeley Research Group saved a legal client about $30 million in costs by using Cleanlab Studio, Northcutt said.

Cleanlab is a young startup, but its underpinnings date back to 2013, when Northcutt — the son of three generations of mailmen in rural Kentucky — graduated from Vanderbilt and began a PhD program in computer science at MIT. While there, he built a cheating detection system for validating online course certificates used by the university and Harvard. Working under adviser Isaac Chuang, a leading quantum science researcher, Northcutt won a prestigious thesis award for his research on “confident learning,” a method he dubbed for removing label errors in machine learning.

During a summer gig at Yann LeCun’s Facebook AI research group in 2016, Northcutt grew fed up with what he saw as human data errors compromising Facebook’s massive data sets. He reached out to two other MIT PhDs — Mueller, who helped build Amazon’s AutoML tools, and Athalye, a computer science researcher whose work has been starred 30,000 times on GitHub — to build an open-source tool to automatically catch labeling errors in such data, called cleanlab, which he incorporated into his research.

Northcutt continued testing the cleanlab software during stints at Amazon and Google, where he worked on machine learning projects to improve Alexa’s and Google Home’s abilities to detect and wake up to voice commands (the devices, in part due to imperfect training data, weren’t always detecting their wakeup prompts). After cofounding and briefly working at a sales AI startup as its chief technology officer, Northcutt reunited with Mueller and Athalye in 2021 to work on Cleanlab full-time. Armed with a $5 million seed round led by Bain Capital Ventures, they kept mostly quiet until July 2023, when they announced their enterprise product, Cleanlab Studio, to the world.

While teams at big companies like Chase and Tesla have used the open-source version, cleanlab, for years, Cleanlab’s paying customers are much newer. One tech giant that Northcutt said he couldn’t disclose is already paying $600,000 per year to improve its data for both its core product analytics as well as AI models, the CEO claimed. Consulting firm Berkeley Research Group saved a legal client about $30 million in costs by using Cleanlab Studio to automatically improve legal document data and models trained on that data for discovery and marking privileged documents, Northcutt said. Popular AI unicorn Hugging Face, which helps users host, train and deploy models, has signed up for both the paid and open-source versions, he added. (Cleanlab later clarified that it had not closed an enterprise deal with the company yet.)

Cleanlab is far from the only startup to promise data salvation for companies looking to build or make use of AI tools. Scale AI reached a $7.3 billion valuation by offering companies like OpenAI data labeling services that mix automation with low-wage human labor in the developing world. Snorkel AI topped a $1 billion valuation in 2021 for its own automated labeling tools. And Dataiku, which offers its version of data preparation software, raised $200 million at a downsized $3.7 billion valuation last December.

Investors Matt Murphy and Schuster Tanger, who co-led the Cleanlab round and joined its board of directors, argued that Cleanlab is “much more than a labeling company” new entrant, as Tanger put it. Cleanlab can do much of what a labeler does, they argued, but not the other way around. And tests like Databricks’ demonstrate that Cleanlab can make models more valuable after their release, not just during their training, Murphy added: “People will have more confidence in these models because [Cleanlab] can measure an output, too.”

Of course, Northcutt and the Cleanlab team will need to convince businesses that they can’t benefit from those improvements simply from using the free version of their software instead, even as they contend with a well-funded field of infrastructure competitors that will likely look to move in more on their turf. (Another reason to count Databricks as an ally.)

Northcutt’s playing a longer-term game. He’s already working on ways Cleanlab can make tiny, open-source models hold their own against the larger ones maintained by AI’s big incumbents. And he’s thinking about what models might come after the LLM wave has crested.

“The biggest barrier to innovation right now for self-driving cars, enterprise adoption of generative AI and real-time analytics is the lack of curated and accurate data,” Northcutt said. “No matter what model comes out in the future, it will depend on data, and Cleanlab will be there.”


Source:Alex Konrad/Forbes

PHP Code Snippets Powered By :