Conformance Tests - Open Harness

Full Suite Coverage

How each adapter performs against the complete test suite (85 tests). Measures overall capability and completeness.

Adapter	Coverage	Passed	Not Implemented
Claude Code	73%	62 / 85	21
Deep Agents	48%	41 / 85	38
Goose	46%	39 / 85	42
Letta	41%	35 / 85	43

How well each adapter implements the features it claims to support. Only tests for declared capabilities.

Adapter	Reliability	Passed	Failed
Claude Code	97%	62 / 64	2
Letta	92%	35 / 38	3
Goose	91%	39 / 43	4
Deep Agents	87%	41 / 47	6

Test results grouped by capability area.

Category	Tests	Claude Code	Letta	Goose	Deep Agents
Execution	6	6/6	5/6	5/6	4/6
Streaming	6	6/6	6/6	5/6	6/6
Tool Events	5	5/5	5/5	5/5	5/5
Sessions	7	--	--	--	--
Agents	7	--	7/7	--	--
Memory	7	--	7/7	--	--
Subagents	6	6/6	--	--	6/6
MCP	6	6/6	--	6/6	--
Files	8	8/8	--	8/8	8/8
Planning	7	7/7	--	--	7/7
Hooks	6	6/6	--	--	--
Skills	7	7/7	--	7/7	--
Tools API	7	5/7	5/7	3/7	5/7

Full Suite Coverage measures how much of the complete API an adapter implements. An adapter that supports few capabilities will score lower here, regardless of how well those features work.
Claimed Feature Reliability measures how well an adapter implements the features it claims to support. This validates that declared capabilities actually work as expected.
-- in the category table indicates the adapter does not declare support for that capability

Some failures are due to missing API keys in the test environment (Goose, Deep Agents)
Tools API register/unregister tests have a known test issue (async not awaited)
Session tests show no results because no adapters currently expose session management
View the full test suite on GitHub