Rethinking AI Evaluation for the Social Sector
Launching the Living Playbook for AI Evaluation in the Social Sector
There’s no doubt that generative AI (GenAI) technology holds significant potential for the social sector. However, the real challenge is: how do you build an AI product that is both technically effective and socially impactful?
This may sound daunting, especially for nonprofits at the start of their technology journey. Yet as OpenAI co-founder Greg Brockman puts it, “evals are surprisingly often all you need.”
But in the context of AI for social impact, what exactly does evaluation mean?
As our partners have started building GenAI tools for social impact, we’ve noticed that different stakeholders answer this question in very different ways. For engineers like Brockman, evaluations are rapid, benchmark-driven tests of large language models. For our colleagues in development economics, by contrast, evaluations typically mean painstaking studies like randomized controlled trials (RCTs).
Both approaches are valuable, but it’s increasingly clear that neither is enough on its own. In the social sector, evaluation can’t stop at measuring a model’s accuracy – it has to answer a bigger question: do GenAI products or services lead to positive, measurable change in people’s lives at scale? At the same time, evaluation shouldn’t be a one-off exercise to check whether a program “works.”
With the rise of GenAI, we believe a paradigm shift is needed for evaluation in the social sector. We see the future of evaluation as a rapid and ongoing cycle of deployment, adaptation, and improvement – an approach where every evaluation is both a milestone and a lesson.
To help the social sector adapt to this new paradigm, we’re releasing a Living Playbook for AI Evaluation in the Social Sector. It draws on lessons from our work with nonprofits, funders, policymakers, and AI practitioners – including through our AI for Global Development (AI4GD) accelerator with OpenAI and the Center for Global Development (CGD). We hope it serves as a practical, evolving guide for teams building and deploying GenAI tools in sectors like education, health, agriculture, livelihoods, and more.
A Four-Level Framework for AI Evaluation
The playbook introduces a four-level framework for AI evaluation in the social sector, building on our blog post with CGD and J-PAL. While the four levels are iterative, they follow a natural progression:
Level 1 – Model evaluation: Does the AI model produce the desired responses?
Level 2 – Product evaluation: Does the product facilitate meaningful interactions?
Level 3 – User evaluation: Does the product positively support users’ thoughts, feelings, and actions?
Level 4 – Impact evaluation: Does access to the product improve human development outcomes?
Each level addresses a different aspect of AI evaluation, engaging distinct tasks and stakeholders (see figure). In our work with eight nonprofits through the AI4GD accelerator, we’re already seeing how this approach helps teams evaluate their AI systems – by focusing on what’s both technically possible and socially valuable.
What You’ll Find in the Playbook
We have aimed to balance top-down guidance with bottom-up perspectives, including practical insights from the organizations in our accelerator. The playbook reflects what we’ve learned so far, including:
Actionable methods for AI evaluation across all four levels (model, product, user, and impact)
Step-by-step guides for building out your digital infrastructure, from user funnels (mapping how users move through your product) to ETL pipelines (systems for extracting, transforming, and loading data)
Practical principles for cross-functional collaboration
Repeatable motions to make this evaluation approach an integral part of product development
While technical, the playbook is high-level – but we’re also in the process of building practical tools to help social sector organizations implement this approach. We recently released Evidential, an experiments automation platform co-developed with Rocket Learning and IDInsight. We’re also developing a unified AI evaluation platform with Project Tech4Dev, which will help teams log changes, monitor performance, and make evidence-based decisions about AI product management.
Why It Matters
The social sector has a chance to lead the development of responsible AI. But this requires aligning diverse stakeholders – from engineers, product managers, and behavioral researchers to funders and policymakers – around a shared evaluation language. This living playbook is our first contribution to that effort.
We’re launching it now – not because it’s finished, but because we’re confident that it’s already working. It’s helping our nonprofit partners design better tools, ask sharper questions, and drive measurable impact. We’ll continue developing the playbook with input from our partners in global development, tech, and academia – we look forward to sharing and learning from this growing community
Resources
📘 Explore the Living Playbook on our website or download the [PDF].
Share your thoughts: We’d love your feedback through this Feedback Form, or self-nominate yourself as a contributor if you're interested in collaborating and open-sourcing your knowledge and work on AI evaluation.





Great post -- I agree with the eval framework levels: Models, Products, Users, Impacts.
What I'll add is that each level of eval is orders of magnitude harder and more expensive to measure. Anyone can (and should!) continuously evaluate model outputs for their use cases. On the other end of the spectrum, end impacts on users necessarily has more noise and is much further downstream. So you need high volumes to get your signal-to-noise right, and you need to collect data on relevant outcomes, like blood pressure control, depending on the context.
The other thing that is important to keep in mind is that this field is moving so quickly, and orgs should be highly agile about their approaches here -- constantly refining both prompts, as well as input and context data. Because those changes are constantly happening, you need to continuously re-evaluate the various levels. If you've changed a prompt to solve a downstream problem, you need to re-run evaluations at the model output level to make sure you haven't introduced new problems in performance.
Continuous eval across multiple levels is needed, but resource intensive to do.