Agent Benchmarking - Search News

Benchmarking AI Agents on Kubernetes

Brandon Foley published a benchmarking study on the CNCF blog showing that AI coding agents can find and fix isolated bugs.

Tech Times

AI Agent Safety: Benchmark Finds None of 13 Agents Cleared 40% Safe Completion

AI agent safety benchmark BeSafe-Bench tested 13 production-grade agents and found none could complete 40% of tasks while ...

Claude, GPT, Gemini Agents Fail 72% of U.S. Healthcare Workflows, New Benchmark Finds

Pleasanton, CA - May 20, 2026 - PRESSADVANTAGE - AI company actAVA.ai today released CHI-Bench, the world’s first ...

InfoWorld

Researchers reveal flaws in AI agent benchmarking

As agents using artificial intelligence have wormed their way into the mainstream for everything from customer service to fixing software code, it’s increasingly important to determine which are the ...

12don MSN

Microsoft’s multi-agent AI system tops Anthropic’s Mythos on cybersecurity benchmark

Microsoft's new vulnerability-scanning system, codenamed MDASH, scored 88.45% on the CyberGym benchmark, surpassing ...

Your AI agents need a terminal, not just a vector database

DCI lets AI agents search raw files with grep and bash instead of embeddings — boosting accuracy 11 points and cutting ...

Tech Times

Visual State Cards in AI Agent Skills More Than Double Small Model Success Rates on Real Desktop Tasks

Reliable desktop automation has long come with a hidden tax: the more complex the software environment, the larger — and more ...

Yahoo Finance

UiPath Screen Agent Powered by Claude Opus 4.5 Receives Top Ranking on OSWorld-Verified Benchmark for Agentic Automation

The above button links to Coinbase. Yahoo Finance is not a broker-dealer or investment adviser and does not offer securities or cryptocurrencies for sale or facilitate trading. Coinbase pays us for ...

6don MSNOpinion

What AI coding benchmarks still miss about software quality

AI coding benchmarks miss long-term code quality degradation from repeated iterative changes.

Some results have been hidden because they may be inaccessible to you

Show inaccessible results