Agentick benchmark evaluates RL, LLM, VLM, and hybrid agents on 37 tasks. GPT-5 mini leads at 0.309 ONS, but no paradigm dominates. ASCII beats natural language. Agentick benchmark pits RL, LLM, VLM, hybrid, and human agents on 37 tasks. GPT-5 mini leads at 0.309 oracle-normalized score, but no paradigm dominates. Key facts 37 procedurally generated tasks across six capability categories 27 agent configurations evaluated over 90,000+ episodes GPT-5 mini leads at 0.309 oracle-normalized score Reasoning harness improves LLM performance 3-10x ASCII observations outperform natural language across all agents Researchers from Google DeepMind and Université de Montréal released Agentick, a unified benchmark for sequential decision-making agents. The benchmark provides 37 procedurally generated tasks across six capability categories — including planning, multi-agent coordination, and memory — with four difficulty levels and five observation modalities [per the arXiv preprint].…