As AI features and agents become deeply embedded in products, evaluation of AI responses becomes critical to building trustworthy experiences. Evaluations determine whether an AI feature consistently meets human expectations for clarity, usefulness, and tone. Although many teams rely on manual human review, this can often be slow and difficult to scale.
To address this, teams are increasingly using LLMs to evaluate the outputs of other LLMs. While these approaches are faster and more efficient, are they trustworthy?
To build AI experiences people can trust, human standards must be embedded into evaluation workflows, seamlessly and at scale. But how?
In this talk, Himali will share lessons from hands-on experience designing and evolving AI features in Microsoft 365, shaped by working closely with partner teams in engineering, product design, research, and applied science. Drawing on her work in the Microsoft‑wide Content Design Community, she’ll share practical strategies for designing evaluation workflows that embed human judgment throughout the process using metrics, rubrics, and tests. She’ll also cover ways to address the limitations of human review, such as inefficiency and bias, while preserving what makes human judgment valuable: qualitative insight, nuance, and empathy.
In this session, you’ll learn how to:

Content Designer 2, Microsoft Ireland
Be the first to hear about Button events, free content design resources, and special offers.