From Computer Vision Certainty to Agent Adaptability: Engineering for Stability in a Changing AI Landscape
Sep 11, 2025
By Jake Gellatly, Director of Product & Jacob Petrisko, Director of Artificial Intelligence
When we first developed computer vision models for our customers, the workflow was relatively straightforward: fine-tune a model, lock and version its parameters, and guarantee reproducibility. Feed the same image into the model today or six months from now, and the prediction would remain identical. Accuracy was measured through well-established metrics, including precision, recall, and F1 score. Performance was predictable, trackable, and deterministic.
The world of AI agents is fundamentally different.
Agents are not static models. They are adaptive systems capable of integrating reasoning, planning, tool usage, and multi-step decision-making. They are designed to solve complex, domain-specific problems by combining unstructured inputs, including images, documents, and free text, with dynamic external data sources. Even when reproducibility parameters, such as topK and temperature, are low, outputs can still vary. The same task executed twice may follow distinct reasoning paths and produce different outcomes.
This inherent nondeterminism introduces a key challenge for evaluation and quality assurance. Unlike traditional models, there is no standardized metric for agent reproducibility. Typical accuracy metrics are insufficient when outputs include reasoning traces, structured JSON responses, or multimodal interpretations. Without additional engineering, it is difficult to quantify performance over time or detect subtle drifts in behavior.
Our Solution: A Test-and-Monitor Framework for Agents
To address these challenges, we developed a test-and-monitor framework tailored for AI agents. Every agent we deploy is built and validated on a curated library of domain-specific test cases. These include:
– Pass/fail checks, such as verifying whether the agent correctly flags an address mismatch between photos and input data.
– Complex multimodal evaluations, such as detecting and classifying dozens of distinct property damages in inspection imagery.
Each test case includes human-generated ground-truth labels, enabling us to measure accuracy both qualitatively and quantitatively. This structured approach allows our team to iterate on agent design, optimize tool usage, and validate performance before deployment in production workflows.
Freezing and Monitoring for Drift
Once an agent reaches its peak measured performance, we “freeze” that version for production deployment. But freezing is only the beginning. Even without modifying the agent code or its backbone, outputs can drift over time. Updates to foundational model APIs, system-level changes, or shifts in external data sources can subtly alter behavior.
To mitigate this risk, we implemented temporal, proactive monitoring. Our system routinely re-runs curated test sets against production agents at scheduled intervals, tracking accuracy trends, reproducibility, and output consistency. If any performance degradation is detected, validated backup agents are automatically ready to assume workloads, ensuring uninterrupted service and consistent quality.
Why This Matters for AI in Real Estate
In real estate, data integrity is critical. Whether assessing property damage, reviewing titles, or validating appraisals, a single inconsistent or incorrect output can have financial, operational, and legal consequences. By engineering for both accuracy and reproducibility, we are building AI systems that do more than perform today. We are ensuring that they remain reliable tomorrow, next month, and next quarter.
The transition from deterministic computer vision models to adaptive AI agents requires new thinking, new tooling, and rigorous monitoring strategies. Our framework provides the confidence to deploy agents in high-stakes workflows while maintaining the consistency and reliability that customers expect.
At our core, we’re still the same team that demanded 100% reproducibility from our computer vision models—we’ve just evolved our tools to meet the complexity of modern AI agents. Stability may be harder to achieve, but with the right engineering, it’s far from impossible.