Author:
Artefact
artefact (1).webpartefact (1).webp
Language:
English

The Future of Agentic Supervision

Digital

Last February, we published “The Future of Work with AI”, our first study on Agentic AI. We found that although AI agents will replace humans on tedious and repetitive tasks, a new type of work will appear: Agentic Supervision. During the industrial revolution, machines replaced humans on manual tasks, but new jobs appeared such as machine purchasing, operational supervision and maintenance. With Agentic AI, cognitive jobs will be replaced by other higher-level and more productive cognitive jobs. This study intends to deep dive into the early days of Agentic Supervision and to draw the outline of the Future of Supervision in terms of Agent lifecycle management, governance and supervision tooling.

To gather the current state of Agentic Supervision, we interviewed 14 enterprises and 5 Artefact Agentic Product Managers & Engineers. We also contacted key Agentic Supervision providers, including major Data & AI platforms with years of software supervision experience (such as Google and Microsoft) as well as specialized start-ups (WB, Giskard, RobustIntelligence…).

The first insight we found is that while Agentic Supervision extends the principles established in DevOps (software operations), DataOps (data operations), and MLOps (Machine Learning operations), it dramatically increases the demand for robust governance to keep AI Agents aligned and under control. Indeed, with “software that starts to think”, unseen risks are emerging, such as hallucination, reasoning errors, inappropriate tone, intellectual property infringement or even prompt jacking. Mitigating these reliability, behavioral, regulatory and security risks now requires governance that is not only more rigorous but also broader than what has previously been applied to tech products.

This markedly greater need for governance is the challenge that may define the emerging operational paradigm of “AgentOps”. Interestingly, AgentOps will need to build upon each organization’s existing DevOps, DataOps, and MLOps foundations and governance, and companies lagging in these operational domains will have to bridge any gaps in these areas while setting their Agentic governance framework.

The second major challenge identified by our interviewees is the need to strengthen their AI supervision tooling. Many are currently relying on existing RPA and Dev/Data/ MLOps tools, or experimenting with custom-built solutions as they search for more sustainable, long-term options. The abundance of early-stage tools and the need to envision a cohesive, end-to-end supervision system that integrates multiple components, prompted us to explore the technological dimensions of agentic supervision in greater depth. As with any TechOps framework, AgentOps supervision involves three fundamental stages: (1) Observe, (2) Evaluate, and (3) Monitor and manage incidents. While the third stage represents the largest supervision effort and time, the first two are essential to ensuring effective risk management. With new categories of risks to monitor and consequently, new logs, traces, and evaluation mechanisms to establish, it’s clear why interviewees consistently emphasized the need for the right tools to support scalable and reliable supervision.

“Supervision should not be an afterthought, it must be embedded early in the agent’s design and development.”

Our research into agentic supervision tools revealed three key insights. First, there is currently no all-in-one solution available. Major cloud providers like Google and Microsoft are actively developing and releasing supervision tools and frameworks aimed at covering the full spectrum of supervision needs for teams building agents on platforms such as Vertex AI (Google) and Copilot Studio (Microsoft). Second, agent supervision falls into two categories: proactive and reactive. Proactive supervision is applied during development to test agents against defined scenarios or, in production, to continuously guard against emerging threats, particularly in the area of security, or to collect aggregated performance data. Its goal is to improve agent behavior over time. Reactive supervision, on the other hand, focuses on detecting and handling live incidents. Although both types rely on observability tools and may use similar evaluation mechanisms, they differ significantly in data sources, evaluation granularity, and response strategies. Finally, our third insight is that agentic observability, evaluation, and risk mitigation remain complex and rapidly evolving domains. We anticipate substantial advancements in supervision tooling over the coming years.

Each phase of the agentic supervision cycle; observe, evaluate, and supervise, presents its own set of challenges.

Observability first requires anticipating what data to capture, which depends heavily on having a clearly defined evaluation and supervision strategy. Without this foresight, teams risk either collecting too little information or being overwhelmed by vast, unstructured traces that hinder manual root cause analysis. Tools like LangSmith and LangChain are increasingly used to structure and streamline the observation of agent behavior. Another major challenge lies in the opacity of LLM reasoning, which must be countered by deliberately designing agent architectures and workflows to ensure traceability and transparency.

Evaluation in agentic AI is significantly more complex than in traditional software or data quality assessments. Where deterministic tests based on observability queries are sufficient in classical DevOps and DataOps, agentic systems often require AI to evaluate AI. This has led to the rise of LLM-as-a-judge techniques; a counterintuitive approach where one model assesses the output of another. While this raises concerns (why trust flawed AI to judge flawed AI?), studies show it often produces more consistent and scalable results than human reviewers. Nonetheless, a common pain point among interviewees was the difficulty of building reliable ground truth datasets, expert-curated question-answer pairs, to benchmark agent responses. Human evaluators tend to disagree and often lack completeness in their answers

Finally, supervision and mitigation face challenges around prioritization. With a growing number of metrics and alerts, teams can quickly become overwhelmed. Standardized frameworks for alerting and metric management are a must to bring structure and clarity to agentic supervision.

Only a handful of organizations have successfully established effective governance and standards for agentic AI. Those with mature software and data governance frame works have had a head start, benefiting from strong foundations and a well-established culture of observability and supervision. We observed that leveraging existing software, RPA, and data supervision practices, processes, and tools can significantly accelerate progress. However, the key challenge lies in adapting these to the dynamic risks and evolving toolsets specific to agentic AI, and in building a dedicated, future-ready governance framework. Relying too long on legacy approaches, including deterministic logic and custom-built tools, can become a constraint, limiting teams to narrow, tightly controlled agentic workflows and preventing the adoption of more autonomous, AI-orchestrated agents.

All interviewees emphasized that the key to effective agentic supervision is anticipation. Supervision should not be an afterthought, it must be embedded early in the agent’s design and development. Setting up observability and evaluation mechanisms only once the agent is in production is too late. Identifying flaws at that stage often means reworking the entire agent, which is far more costly than investing in robust supervision from the start.

The good news is that a variety of tested tool combinations and emerging agentic frameworks are already available. We strongly recommend that enterprise AI governance teams define their own standardized framework and toolset to be applied across all agentic development. This becomes even more critical as agents begin to interconnect, making system-wide control and supervision interoperability essential.

To succeed, AI governance must also align closely with strong IT and Data Governance practices, since agents rely on enterprise data and IT systems to ‘think’ and take ‘action.’ Just as IT and data governance required business involvement in the past, one of the key takeaways from our study is that agentic governance will demand even deeper business engagement.

Unlike traditional software or data supervision, typically handled by IT or data teams (and in the most mature organizations, by a business-led data governance network), agent supervision will need to be business-owned. Given the inherent unpredictability of AI agents, incident responses often require domain expertise. As a result, the business must be actively involved not just in monitoring, but in framing agent behavior from the outset. This represents a significant cultural shift: agentic AI blurs the lines between IT, data, and business, and will require new ways of working based on cross-functional collaboration. Agentic Supervision is the Future of Work with AI!

Contents:

  • Agentic AI risks are shaking up the tech governance & supervision game.
  • The new AgentOps stack: tests, guardrails and feedback loops.
  • Secure and accelerate Agentic AI with standards & global governance.
  • Conclusion

Continue reading...

Get access to 100s of case studies, workshop templates, industry leading events and more.
See membership options
Already a member? Sign in

The Future of Agentic Supervision

Digital

Last February, we published “The Future of Work with AI”, our first study on Agentic AI. We found that although AI agents will replace humans on tedious and repetitive tasks, a new type of work will appear: Agentic Supervision. During the industrial revolution, machines replaced humans on manual tasks, but new jobs appeared such as machine purchasing, operational supervision and maintenance. With Agentic AI, cognitive jobs will be replaced by other higher-level and more productive cognitive jobs. This study intends to deep dive into the early days of Agentic Supervision and to draw the outline of the Future of Supervision in terms of Agent lifecycle management, governance and supervision tooling.

To gather the current state of Agentic Supervision, we interviewed 14 enterprises and 5 Artefact Agentic Product Managers & Engineers. We also contacted key Agentic Supervision providers, including major Data & AI platforms with years of software supervision experience (such as Google and Microsoft) as well as specialized start-ups (WB, Giskard, RobustIntelligence…).

The first insight we found is that while Agentic Supervision extends the principles established in DevOps (software operations), DataOps (data operations), and MLOps (Machine Learning operations), it dramatically increases the demand for robust governance to keep AI Agents aligned and under control. Indeed, with “software that starts to think”, unseen risks are emerging, such as hallucination, reasoning errors, inappropriate tone, intellectual property infringement or even prompt jacking. Mitigating these reliability, behavioral, regulatory and security risks now requires governance that is not only more rigorous but also broader than what has previously been applied to tech products.

This markedly greater need for governance is the challenge that may define the emerging operational paradigm of “AgentOps”. Interestingly, AgentOps will need to build upon each organization’s existing DevOps, DataOps, and MLOps foundations and governance, and companies lagging in these operational domains will have to bridge any gaps in these areas while setting their Agentic governance framework.

The second major challenge identified by our interviewees is the need to strengthen their AI supervision tooling. Many are currently relying on existing RPA and Dev/Data/ MLOps tools, or experimenting with custom-built solutions as they search for more sustainable, long-term options. The abundance of early-stage tools and the need to envision a cohesive, end-to-end supervision system that integrates multiple components, prompted us to explore the technological dimensions of agentic supervision in greater depth. As with any TechOps framework, AgentOps supervision involves three fundamental stages: (1) Observe, (2) Evaluate, and (3) Monitor and manage incidents. While the third stage represents the largest supervision effort and time, the first two are essential to ensuring effective risk management. With new categories of risks to monitor and consequently, new logs, traces, and evaluation mechanisms to establish, it’s clear why interviewees consistently emphasized the need for the right tools to support scalable and reliable supervision.

“Supervision should not be an afterthought, it must be embedded early in the agent’s design and development.”

Our research into agentic supervision tools revealed three key insights. First, there is currently no all-in-one solution available. Major cloud providers like Google and Microsoft are actively developing and releasing supervision tools and frameworks aimed at covering the full spectrum of supervision needs for teams building agents on platforms such as Vertex AI (Google) and Copilot Studio (Microsoft). Second, agent supervision falls into two categories: proactive and reactive. Proactive supervision is applied during development to test agents against defined scenarios or, in production, to continuously guard against emerging threats, particularly in the area of security, or to collect aggregated performance data. Its goal is to improve agent behavior over time. Reactive supervision, on the other hand, focuses on detecting and handling live incidents. Although both types rely on observability tools and may use similar evaluation mechanisms, they differ significantly in data sources, evaluation granularity, and response strategies. Finally, our third insight is that agentic observability, evaluation, and risk mitigation remain complex and rapidly evolving domains. We anticipate substantial advancements in supervision tooling over the coming years.

Each phase of the agentic supervision cycle; observe, evaluate, and supervise, presents its own set of challenges.

Observability first requires anticipating what data to capture, which depends heavily on having a clearly defined evaluation and supervision strategy. Without this foresight, teams risk either collecting too little information or being overwhelmed by vast, unstructured traces that hinder manual root cause analysis. Tools like LangSmith and LangChain are increasingly used to structure and streamline the observation of agent behavior. Another major challenge lies in the opacity of LLM reasoning, which must be countered by deliberately designing agent architectures and workflows to ensure traceability and transparency.

Evaluation in agentic AI is significantly more complex than in traditional software or data quality assessments. Where deterministic tests based on observability queries are sufficient in classical DevOps and DataOps, agentic systems often require AI to evaluate AI. This has led to the rise of LLM-as-a-judge techniques; a counterintuitive approach where one model assesses the output of another. While this raises concerns (why trust flawed AI to judge flawed AI?), studies show it often produces more consistent and scalable results than human reviewers. Nonetheless, a common pain point among interviewees was the difficulty of building reliable ground truth datasets, expert-curated question-answer pairs, to benchmark agent responses. Human evaluators tend to disagree and often lack completeness in their answers

Finally, supervision and mitigation face challenges around prioritization. With a growing number of metrics and alerts, teams can quickly become overwhelmed. Standardized frameworks for alerting and metric management are a must to bring structure and clarity to agentic supervision.

Only a handful of organizations have successfully established effective governance and standards for agentic AI. Those with mature software and data governance frame works have had a head start, benefiting from strong foundations and a well-established culture of observability and supervision. We observed that leveraging existing software, RPA, and data supervision practices, processes, and tools can significantly accelerate progress. However, the key challenge lies in adapting these to the dynamic risks and evolving toolsets specific to agentic AI, and in building a dedicated, future-ready governance framework. Relying too long on legacy approaches, including deterministic logic and custom-built tools, can become a constraint, limiting teams to narrow, tightly controlled agentic workflows and preventing the adoption of more autonomous, AI-orchestrated agents.

All interviewees emphasized that the key to effective agentic supervision is anticipation. Supervision should not be an afterthought, it must be embedded early in the agent’s design and development. Setting up observability and evaluation mechanisms only once the agent is in production is too late. Identifying flaws at that stage often means reworking the entire agent, which is far more costly than investing in robust supervision from the start.

The good news is that a variety of tested tool combinations and emerging agentic frameworks are already available. We strongly recommend that enterprise AI governance teams define their own standardized framework and toolset to be applied across all agentic development. This becomes even more critical as agents begin to interconnect, making system-wide control and supervision interoperability essential.

To succeed, AI governance must also align closely with strong IT and Data Governance practices, since agents rely on enterprise data and IT systems to ‘think’ and take ‘action.’ Just as IT and data governance required business involvement in the past, one of the key takeaways from our study is that agentic governance will demand even deeper business engagement.

Unlike traditional software or data supervision, typically handled by IT or data teams (and in the most mature organizations, by a business-led data governance network), agent supervision will need to be business-owned. Given the inherent unpredictability of AI agents, incident responses often require domain expertise. As a result, the business must be actively involved not just in monitoring, but in framing agent behavior from the outset. This represents a significant cultural shift: agentic AI blurs the lines between IT, data, and business, and will require new ways of working based on cross-functional collaboration. Agentic Supervision is the Future of Work with AI!

Contents:

  • Agentic AI risks are shaking up the tech governance & supervision game.
  • The new AgentOps stack: tests, guardrails and feedback loops.
  • Secure and accelerate Agentic AI with standards & global governance.
  • Conclusion