τ-bench: A New Benchmark to Evaluate AI Agents’ Performance and Reliability in Real-World Settings with Dynamic User and Tool Interaction
Current benchmarks for language agents fall short in assessing their ability to interact with humans or adhere to complex, domain-specific...