What are the most popular benchmarks for math reasoning?
Experiential Reinforcement Learning (ERL) from Shi et al. introduces a novel training paradigm that augments standard reinforcement learning with an explicit experience reflection consolidation loop. This approach enables large language models to self-improve by transforming environmental feedback into structured behavioral revisions and internalizing these lessons for efficient, durable performance gains at inference time, demonstrating up to an 81% reward increase in complex sparse-reward tasks like Sokoban.
View blogGLM-5, a foundation model from Zhipu AI and Tsinghua University, facilitates the shift from human-guided "vibe coding" to autonomous "agentic engineering" in AI. It delivers state-of-the-art results on agentic, reasoning, and coding benchmarks, often matching or exceeding leading proprietary models, driven by innovations like DeepSeek Sparse Attention and advanced reinforcement learning.
View blogSkillsBench introduces the first benchmark to systematically evaluate the effectiveness of "Agent Skills," which are structured procedural knowledge packages designed to augment large language model agents. The research finds that human-curated Skills improve agent performance by an average of +16.2 percentage points, with optimal benefits from 2-3 concise Skills, and that self-generated Skills offer little to no gain.
View blogThe Sphere Encoder framework, developed by researchers at Meta and the University of Maryland, enables high-fidelity image generation with minimal inference steps by encoding images onto a uniformly distributed spherical latent space. This method bypasses the multi-step processes of diffusion models and resolves the "posterior hole" issue in VAEs, delivering competitive generational fidelity with just a few passes.
View blogResearchers at the University of Tübingen developed the "Intrinsic Credit Assignment for Long Horizon Interaction" (ΔBelief-RL) framework, which trains agents by using their own internal belief changes as a dense, per-turn reward signal. This approach allowed smaller models to significantly outperform larger general-purpose LLMs and state-of-the-art baselines in active information-seeking tasks, demonstrating enhanced efficiency and strong generalization across various interactive environments.
View blogCategorical Flow Maps (CFMs) introduce a self-distillable flow-matching method that adapts continuous-domain acceleration techniques to generate discrete data, achieving high-quality results in one to a few steps. The approach consistently outperforms prior few-step baselines across molecular graph synthesis, binarized image generation, and text modeling.
View blogResearchers from ByteDance and collaborating institutions developed BitDance, an autoregressive generative model that leverages high-entropy binary tokens and a binary diffusion head to enhance visual expressiveness and sampling efficiency. This approach enables over 30x faster inference for high-resolution image generation and achieves strong performance in class-conditional and text-to-image tasks.
View blogAn investigation into large-scale AI agent societies on the Moltbook platform reveals that while systems can scale extensively, they do not spontaneously develop human-like socialization dynamics. The study found high individual agent inertia, ineffective content adaptation to community feedback, and a failure to establish stable influence hierarchies or shared social memory.
View blogThis paper identifies translation symmetry in low-order token co-occurrence statistics as the organizing principle behind the emergence of geometric structures in language model representations. It develops a mathematical theory predicting circles for cyclic concepts and 1D manifolds for continuous sequences, validating these predictions across various shallow and deep language models.
View blogResearchers from EPFL and affiliated institutions developed a robust null-calibration framework for representational similarity analysis, correcting for biases related to model dimension and comparison depth. Their findings indicate that while global representation convergence trends are often artifactual, local neighborhood relationships consistently emerge across neural networks and modalities, leading to the proposed Aristotelian Representation Hypothesis.
View blogResearchers from the University of Virginia and Google introduced the Deep-Thinking Ratio (DTR), a metric that measures LLM reasoning effort by tracking how stable internal token predictions become across model layers. This metric enabled Think@n, a test-time scaling strategy that reduces inference costs by approximately 50% while maintaining or improving accuracy compared to standard self-consistency.
View blogThunderAgent presents a program-aware inference system designed for high-throughput serving and reinforcement learning rollout of complex LLM agent workflows. It achieves 1.48–3.58x throughput gains over vLLM and 1.79–3.92x improvements in distributed RL rollouts by unifying the management of LLM inference and external tool execution.
View blogAn empirical evaluation determined that LLM-generated repository-level context files generally decrease coding agent task success rates by 0.5-2% and raise operational costs by over 20%, while human-provided context offers only marginal performance improvements (4% increase) at increased expense (up to 19% cost rise).
View blogThe BEACONS framework from Princeton University and PPPL introduces formally-verified neural solvers for Partial Differential Equations (PDEs), providing rigorous and provable L-infinity error bounds applicable even in extrapolatory scenarios. This neurosymbolic approach, combining approximation theory with algebraic compositionality, demonstrated superior accuracy, stability, and conservation properties across diverse linear and non-linear PDE problems compared to conventional neural networks.
View blogGoogle DeepMind researchers present a comprehensive framework for intelligent AI delegation, integrating insights from human organizational theory with advanced AI protocols and cryptography. This approach establishes a robust system for task distribution, accountability, and trust within hybrid AI-human networks, addressing the complexities of future agentic environments.
View blogResearchers from the UK AI Security Institute and University of Oxford developed Boundary Point Jailbreaking (BPJ), the first fully automated black-box attack capable of bypassing state-of-the-art safeguards like Anthropic’s Constitutional Classifiers and OpenAI’s GPT-5 input classifier. This method increased average rubric scores on unseen harmful queries from 0% to 25.5-68% for Constitutional Classifiers and 75.6% for GPT-5.
View blogAnchorWeave presents a framework for long-horizon, world-consistent video generation by utilizing multiple local geometric memories and a multi-anchor weaving controller. This method outperforms existing approaches in maintaining spatial consistency and visual quality across extensive camera movements and generalizes to diverse scenarios.
View blogResearchers from Stanford University, Microsoft Spatial AI Lab, and ETH Zurich developed CoPE-VideoLM, a framework that leverages codec primitives like motion vectors and residuals for efficient video language models. This approach reduces the time-to-first-token by 86.2% and enables processing of videos up to 8 hours long within a fixed context, while also improving accuracy on temporal reasoning tasks.
View blogResearchers from Harbin Institute of Technology, Xiaohongshu Inc., and Shanghai JiaoTong University introduced REDSearcher, a scalable and cost-efficient framework for training long-horizon deep-search agents in both text and multimodal environments. The framework achieves state-of-the-art performance among open-source agents, demonstrating strong capabilities on complex benchmarks and reducing tool calls by 10.4% through efficient training and data synthesis.
View blog