Highly Recommended: Andrew Ng on Data-Centric AI

In building out Incantata’s “human+magic” coaching platform, we’ve run into the same problem that so many companies wishing to leverage AI confront: there just aren’t good tools and practices for managing a machine learning development lifecycle in a way that efficiently produces useful AI-driven features for the business and its customers.

There’s this concept of an SDLC (software development lifecycle) but as of yet, there’s no widely accepted or utilized machine learning development lifecycle (MLDLC??). There’s an emerging awareness that we need something called MLOps — collaboration between data scientists and the engineers and operations people who productize their models. But the practice and tools are so nascent that it’s hard to say such a thing even exists right now.

I feel disenchanted with the state of our practice here, people!

It sounds like Ng does as well.

I encourage you to watch the video yourself, but if you don’t have time here are some highlights (cribbed from Incantata cofounder and DevOps expert John’s notes):

  • Data is food for AI, don’t feed it junk!
  • 80% of time on an ML/AI project is spent on preparing the data, and 20% on modeling. Despite this reality, 99% of AI research is around model-centric approaches to improving results.
  • The most important task of MLOps is to ensure consistently high-quality data in all phases of the ML project lifecycle
  • Especially for small datasets (<10,000 observations) cleaning up labels (making them more consistent) is a more efficient way of improving accuracy than collecting more data

Ng’s main theme in this video is around refocusing the industry on data improvements rather than model improvements, which I think is important. A secondary point he makes is the need for standard MLOps practices and off-the-shelf tools that support a repeatable and efficient process for AI/ML development.

To make your head spin a little, check out this table (also from John’s notes):

Whaddaya think? Will data scientists go the way of computer scientists? Are ML Engineers all we need? What will an MLDLC look like?