DeltaRubric

DeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification

Rui Liu^1,2, Dian Yu¹, Zhenwen Liang¹, Yucheng Shi¹, Tong Zheng², Runpeng Dai³, Haitao Mi¹, Pratap Tokekar², Leoweiliang¹

¹Tencent Hunyuan ²University of Maryland, College Park ³University of North Carolina, Chapel Hill

Abstract

Aligning Multimodal Large Language Models (MLLMs) requires reliable reward models, yet existing single-step evaluators can suffer from lazy judging, exploiting language priors over fine-grained visual verification. While rubric-based evaluation mitigates these biases in text-only settings, extending it to multimodal tasks is bottlenecked by the complexity of visual reasoning. The critical differences between responses often depend on instance-specific visual details. Robust evaluation requires dynamically synthesizing rubrics that isolate spatial and factual discrepancies. To address this, we introduce DeltaRubric, an approach that reformulates multimodal preference evaluation as a plan-and-execute process within a single MLLM. DeltaRubric operates in two steps: acting first as a Disagreement Planner, the model generates a neutral, instance-specific verification checklist. Transitioning into a Checklist Verifier, it executes these self-generated checks against the image and question to produce the final grounded judgment. We formulate DeltaRubric as a multi-role reinforcement learning problem, jointly optimizing planning and verification capabilities. Validated on Qwen3-VL 4B and 8B Instruct models, DeltaRubric achieves solid empirical gains. For instance, on VL-RewardBench, it improves base model overall accuracy by +22.6 (4B) and +18.8 (8B) points, largely outperforming standard no-rubric baselines. The results demonstrate that decomposing evaluation into structured, verifiable steps leads to more reliable and generalizable multimodal reward modeling.

Method

DeltaRubric trains a single shared MLLM to perform two complementary roles. The Disagreement Planner identifies concrete visual attributes, object counts, spatial relations, and hallucinated claims that distinguish two candidate responses. The Checklist Verifier then executes the generated checklist against the image and prompt before selecting the preferred response. Both roles are jointly optimized with multi-role reinforcement learning, using decoupled advantage estimates so that planning gradients reflect checklist quality while verification gradients reflect execution quality.

Training Dynamics

Verifier training accuracy curve — **Figure 1.** Verifier training accuracy compared with a no-rubric baseline.

Verifier validation accuracy curve — **Figure 2.** Verifier validation accuracy measured every five steps.

Planner probe accuracy curve — **Figure 3.** Planner probe accuracy, a proxy for checklist usefulness.

The training curves show that DeltaRubric improves both final response evaluation and intermediate checklist quality. The Planner's steadily rising probe accuracy indicates that generated checklists become increasingly decision-useful, aligning with the Verifier's higher training and validation accuracy.

Experimental Results

VL-RewardBench

DeltaRubric improves the overall accuracy of Qwen3-VL-4B and 8B Instruct base models by +22.6 and +18.8 points, respectively, and outperforms the standard no-rubric baseline across both architectures.

Models	General	Hallucination	Reasoning	Overall	Macro Avg
Open-Source Models
VITA-1.5-7B	18.6	8.9	22.1	16.5	16.5
SliME-7B	7.2	27.1	18.6	19.0	17.6
Molmo-7B	31.1	31.8	56.2	37.5	39.7
MM-RLHF-Reward-7B	45.0	50.5	57.6	50.2	51.0
InternVL2-8B	35.6	41.1	59.0	44.5	45.2
LLaVA-Critic-8B	54.6	38.3	59.1	44.5	45.2
Llama-3.2-11B	33.3	38.4	56.6	42.9	42.8
NVLM-D-72B	38.9	31.6	62.0	40.1	44.1
Llama-3.2-90B	42.6	57.3	61.7	56.2	53.9
DeltaRubric
Qwen3-VL-4B Instruct	46.4	64.9	36.0	54.9	49.1
+ No rubric	51.9	87.1	50.8	73.2	63.3
+ DeltaRubric	55.3	87.7	65.9	77.5	69.6
Qwen3-VL-8B Instruct	47.0	72.4	43.2	61.3	54.2
+ No rubric	55.8	86.1	48.3	72.0	63.4
+ DeltaRubric	59.7	88.3	72.6	80.1	73.5

Multimodal RewardBench

DeltaRubric improves the Qwen3-VL-8B Instruct overall accuracy by +5.5 points over the base model and by +4.5 points over the no-rubric baseline.

Model	Overall	General Correctness	General Preference	Knowledge	Math	Coding	Safety	VQA
Open-Source Models
VITA-1.5-7B	53.6	55.6	54.3	52.5	51.9	52.8	58.1	50.0
Molmo-7B	52.9	56.8	59.4	54.6	50.7	53.4	34.8	60.3
MM-RLHF-Reward-7B	67.1	61.7	67.5	54.3	58.4	57.9	92.9	76.8
SliME-8B	42.0	42.3	52.2	47.5	43.5	35.3	19.1	53.8
InternVL3-8B	63.6	59.6	61.6	60.5	65.1	56.6	59.3	82.3
Llama-3.2-11B	51.2	57.8	65.8	55.5	50.6	51.7	20.9	55.8
Llama-3.2-90B	61.2	60.0	68.4	61.2	56.3	53.1	52.0	77.1
DeltaRubric
Qwen3-VL-4B Instruct	65.3	66.1	56.3	52.7	46.7	54.4	80.4	70.8
+ No rubric	66.4	74.5	60.1	59.4	54.3	55.0	87.6	71.3
+ DeltaRubric	69.1	73.7	65.4	60.0	58.0	52.0	91.2	80.8
Qwen3-VL-8B Instruct	67.7	68.9	61.5	56.2	64.6	49.6	82.6	71.4
+ No rubric	68.7	75.0	62.7	56.7	64.0	51.5	91.5	77.0
+ DeltaRubric	73.2	76.9	65.9	69.5	68.7	52.6	93.3	84.9

Qualitative Examples

Qualitative comparison of no-rubric and DeltaRubric evaluations — **Example 1.** The no-rubric baseline can miss visually grounded hallucinations, while DeltaRubric generates a targeted disagreement checklist that explicitly enforces visual verification before selecting the better response.

Qualitative comparison on fine-grained visual shoe color verification — **Example 2.** The standard no-rubric baseline fails to verify fine-grained visual details and incorrectly validates the hallucinated "white shoes" in Response B. In contrast, DeltaRubric generates a targeted checklist that explicitly isolates the conflicting shoe color. By enforcing active visual verification, DeltaRubric successfully catches the hallucination and correctly selects Response A.

Qualitative comparison on logical consistency and visual evidence — **Example 3.** The standard no-rubric baseline exhibits severe logical inconsistency: it notes the absence of a tree branch but still chooses Response A. DeltaRubric systematically verifies the visual evidence, such as granular food in an open palm rather than a branch, and correctly selects Response B.

Qualitative comparison on fine-grained visual attribute binding — **Example 4.** The standard no-rubric baseline struggles with fine-grained visual attribute binding, confusing white tissues with the actual color of the box and incorrectly selecting Response B. DeltaRubric isolates and verifies the "green exterior," preventing this attribute confusion and correctly selecting Response A.

Citation

@article{liu2026deltarubric, title={DeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification}, author={Liu, Rui and Yu, Dian and Liang, Zhenwen and Shi, Yucheng and Zheng, Tong and Dai, Runpeng and Mi, Haitao and Tokekar, Pratap and Leoweiliang}, journal={arXiv preprint arXiv:2605.09269}, year={2026} }