Don’t forget evaluation, not just of your model but your whole system

Much of the science in data science involves how you evaluate your models. Are they accurate? Do they balance precision and recall properly for your use case? How do they perform on different slices of your dataset?

More important than evaluating models in isolation: how well do the systems that rely on your machine learning or other models (including foundation models like ChatGPT) achieve your objectives? Do they help your internal staff execute with more efficiency? Do they engage your end users better? Do they lead to improved business outcomes?

It’s difficult to evaluate the output of GenAI models like GPT-4. It’s even more difficult to evaluate systems built around them. Lately, I see too little emphasis on evaluation.

I enjoyed the conversation that Alex Ratner, founder of Snorkel AI, had with Douwe Keila, author of the original paper about retrieval augmented generation (RAG). Ratner asked Keila, “What are some of the challenges in the enterprise that you think the academic or open-source community underestimates?”

Keila responded:

I would say evaluation. I still think the field is in a complete evaluation crisis.
…
[What’s] really missing… is systems thinking. That’s by design sometimes, but right now we’re getting to a point where you can evaluate either the language model itself in a very isolated setting, or you can evaluate the whole system that is built around that language model. And those are very different things and you cannot compare them.
https://snorkel.ai/retrieval-augmented-generation-s-rag-a-conversation-with-its-creator/

Further:

[Data] is the real gold here. The architecture is generalized. Everybody uses the same kind of thing with a couple of tweaks. Compute is currently a scarce resource, but that’s going to become less scarce over time. So, it really is all about the data.

Maybe this is starting to change now, but for a long time, both in industry and academia, people didn’t have enough respect for data and how important it is and how much you can gain from thinking about the data. For example, by doing a weakly supervised learning, or trying to understand your weaknesses and trying to patch up those weaknesses you can close the additional 10 percent that you need to get to a production deployment from your cute demo.
https://snorkel.ai/retrieval-augmented-generation-s-rag-a-conversation-with-its-creator/

Getting a data science team to spend time looking at data, building ground truth datasets, carefully evaluating what they’re building (model-level and system-level), going back and collecting more data… I’ve found that hard.

The AI/ML space right now reminds me a little of economics, where the practitioners get fixated on fancy math and models rather than on what works and what matters.

I’m not saying the fancy math and models and GPUs don’t matter: they do! But for rank-and-file data science teams in smallish-to-medium-sized businesses they matter far less than getting the data and evaluation right.

Discover more from incantata.ai