Chain-of-Thought (CoT) prompting has revolutionized the reasoning capabili- ties of Large Language Models (LLMs), yet it often incurs significant computational costs due to the generation of redundant, verbose, or irrelevant reasoning steps. While var...
Chain-of-Thought (CoT) prompting has revolutionized the reasoning capabili- ties of Large Language Models (LLMs), yet it often incurs significant computational costs due to the generation of redundant, verbose, or irrelevant reasoning steps. While various Process Reward Models (PRMs) have been proposed to evaluate step-by-step reasoning, our analysis reveals that even state-of-the-art methods—specifically Rea- sonEval, ThinkPRM, and Qwen-Math-2.5-PRM—struggle to effectively distinguish “valid but redundant” steps, such as excessive decomposition or context-aware irrele- vant details. To systematically diagnose this limitation, we introduce RIV-GSM8K, a diag- nostic benchmark synthetically injected with five distinct types of inefficiency, includ- ing Redundancy, Irrelevance, and Verbosity. Using this benchmark, we demonstrate the blind spots of existing PRMs in detecting these subtle inefficiencies. Addressing these gaps, we propose CAID (Context-Aware Information Den- sity), a novel reference-free metric that quantifies reasoning efficiency by integrating local novelty, normalized information density, and global goal alignment. Further- more, we present PACE (Pruning And Compression for Efficiency), a lightweight, training-free post-hoc optimization framework that leverages CAID to dynamically compress trivial intermediate steps and prune irrelevant ones. Experiments on GSM8K, StrategyQA, and ARC-Challenge show that PACE reduces token consumption by 31–53% while maintaining or even improving rea- soning accuracy. These results demonstrate that information-theoretic optimization offers a more robust solution to the over-reasoning problem than existing PRM-based approaches.