DeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification

Rui Liu1,2, Dian Yu1, Zhenwen Liang1, Yucheng Shi1, Tong Zheng2, Runpeng Dai3, Haitao Mi1, Pratap Tokekar2, Leoweiliang1
1Tencent Hunyuan 2University of Maryland, College Park 3University of North Carolina, Chapel Hill
Overview of the DeltaRubric planner-verifier framework
DeltaRubric overview. DeltaRubric reformulates multimodal preference evaluation as a self-guided plan-and-execute process: a Disagreement Planner first synthesizes an instance-specific verification checklist, and a Checklist Verifier then executes the checklist against the image and question to produce a grounded judgment. We formulate DeltaRubric as a multi-role RL problem where planning and verification are optimized jointly within a single shared MLLM.

Abstract

Aligning Multimodal Large Language Models (MLLMs) requires reliable reward models, yet existing single-step evaluators can suffer from lazy judging, exploiting language priors over fine-grained visual verification. While rubric-based evaluation mitigates these biases in text-only settings, extending it to multimodal tasks is bottlenecked by the complexity of visual reasoning. The critical differences between responses often depend on instance-specific visual details. Robust evaluation requires dynamically synthesizing rubrics that isolate spatial and factual discrepancies. To address this, we introduce DeltaRubric, an approach that reformulates multimodal preference evaluation as a plan-and-execute process within a single MLLM. DeltaRubric operates in two steps: acting first as a Disagreement Planner, the model generates a neutral, instance-specific verification checklist. Transitioning into a Checklist Verifier, it executes these self-generated checks against the image and question to produce the final grounded judgment. We formulate DeltaRubric as a multi-role reinforcement learning problem, jointly optimizing planning and verification capabilities. Validated on Qwen3-VL 4B and 8B Instruct models, DeltaRubric achieves solid empirical gains. For instance, on VL-RewardBench, it improves base model overall accuracy by +22.6 (4B) and +18.8 (8B) points, largely outperforming standard no-rubric baselines. The results demonstrate that decomposing evaluation into structured, verifiable steps leads to more reliable and generalizable multimodal reward modeling.

Method

DeltaRubric trains a single shared MLLM to perform two complementary roles. The Disagreement Planner identifies concrete visual attributes, object counts, spatial relations, and hallucinated claims that distinguish two candidate responses. The Checklist Verifier then executes the generated checklist against the image and prompt before selecting the preferred response. Both roles are jointly optimized with multi-role reinforcement learning, using decoupled advantage estimates so that planning gradients reflect checklist quality while verification gradients reflect execution quality.

Training Dynamics

Verifier training accuracy curve
Figure 1. Verifier training accuracy compared with a no-rubric baseline.
Verifier validation accuracy curve
Figure 2. Verifier validation accuracy measured every five steps.
Planner probe accuracy curve
Figure 3. Planner probe accuracy, a proxy for checklist usefulness.

The training curves show that DeltaRubric improves both final response evaluation and intermediate checklist quality. The Planner's steadily rising probe accuracy indicates that generated checklists become increasingly decision-useful, aligning with the Verifier's higher training and validation accuracy.

Experimental Results

VL-RewardBench

DeltaRubric improves the overall accuracy of Qwen3-VL-4B and 8B Instruct base models by +22.6 and +18.8 points, respectively, and outperforms the standard no-rubric baseline across both architectures.

Models General Hallucination Reasoning Overall Macro Avg
Open-Source Models
VITA-1.5-7B18.68.922.116.516.5
SliME-7B7.227.118.619.017.6
Molmo-7B31.131.856.237.539.7
MM-RLHF-Reward-7B45.050.557.650.251.0
InternVL2-8B35.641.159.044.545.2
LLaVA-Critic-8B54.638.359.144.545.2
Llama-3.2-11B33.338.456.642.942.8
NVLM-D-72B38.931.662.040.144.1
Llama-3.2-90B42.657.361.756.253.9
DeltaRubric
Qwen3-VL-4B Instruct46.464.936.054.949.1
  + No rubric51.987.150.873.263.3
  + DeltaRubric55.387.765.977.569.6
Qwen3-VL-8B Instruct47.072.443.261.354.2
  + No rubric55.886.148.372.063.4
  + DeltaRubric59.788.372.680.173.5

Multimodal RewardBench

DeltaRubric improves the Qwen3-VL-8B Instruct overall accuracy by +5.5 points over the base model and by +4.5 points over the no-rubric baseline.

Model Overall General Correctness General Preference Knowledge Math Coding Safety VQA
Open-Source Models
VITA-1.5-7B53.655.654.352.551.952.858.150.0
Molmo-7B52.956.859.454.650.753.434.860.3
MM-RLHF-Reward-7B67.161.767.554.358.457.992.976.8
SliME-8B42.042.352.247.543.535.319.153.8
InternVL3-8B63.659.661.660.565.156.659.382.3
Llama-3.2-11B51.257.865.855.550.651.720.955.8
Llama-3.2-90B61.260.068.461.256.353.152.077.1
DeltaRubric
Qwen3-VL-4B Instruct65.366.156.352.746.754.480.470.8
  + No rubric66.474.560.159.454.355.087.671.3
  + DeltaRubric69.173.765.460.058.052.091.280.8
Qwen3-VL-8B Instruct67.768.961.556.264.649.682.671.4
  + No rubric68.775.062.756.764.051.591.577.0
  + DeltaRubric73.276.965.969.568.752.693.384.9

Qualitative Examples

Qualitative comparison of no-rubric and DeltaRubric evaluations
Example 1. The no-rubric baseline can miss visually grounded hallucinations, while DeltaRubric generates a targeted disagreement checklist that explicitly enforces visual verification before selecting the better response.
Qualitative comparison on fine-grained visual shoe color verification
Example 2. The standard no-rubric baseline fails to verify fine-grained visual details and incorrectly validates the hallucinated "white shoes" in Response B. In contrast, DeltaRubric generates a targeted checklist that explicitly isolates the conflicting shoe color. By enforcing active visual verification, DeltaRubric successfully catches the hallucination and correctly selects Response A.
Qualitative comparison on logical consistency and visual evidence
Example 3. The standard no-rubric baseline exhibits severe logical inconsistency: it notes the absence of a tree branch but still chooses Response A. DeltaRubric systematically verifies the visual evidence, such as granular food in an open palm rather than a branch, and correctly selects Response B.
Qualitative comparison on fine-grained visual attribute binding
Example 4. The standard no-rubric baseline struggles with fine-grained visual attribute binding, confusing white tissues with the actual color of the box and incorrectly selecting Response B. DeltaRubric isolates and verifies the "green exterior," preventing this attribute confusion and correctly selecting Response A.

Citation

@article{liu2026deltarubric,
  title={DeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification},
  author={Liu, Rui and Yu, Dian and Liang, Zhenwen and Shi, Yucheng and Zheng, Tong and Dai, Runpeng and Mi, Haitao and Tokekar, Pratap and Leoweiliang},
  journal={arXiv preprint arXiv:2605.09269},
  year={2026}
}