Discussion about this post

User's avatar
Rob's avatar

Great post -- I agree with the eval framework levels: Models, Products, Users, Impacts.

What I'll add is that each level of eval is orders of magnitude harder and more expensive to measure. Anyone can (and should!) continuously evaluate model outputs for their use cases. On the other end of the spectrum, end impacts on users necessarily has more noise and is much further downstream. So you need high volumes to get your signal-to-noise right, and you need to collect data on relevant outcomes, like blood pressure control, depending on the context.

The other thing that is important to keep in mind is that this field is moving so quickly, and orgs should be highly agile about their approaches here -- constantly refining both prompts, as well as input and context data. Because those changes are constantly happening, you need to continuously re-evaluate the various levels. If you've changed a prompt to solve a downstream problem, you need to re-run evaluations at the model output level to make sure you haven't introduced new problems in performance.

Continuous eval across multiple levels is needed, but resource intensive to do.

No posts

Ready for more?