Scaling Reward Modeling without Human Supervision

Jingxuan Fan ¹, Yueying Li ², Zhenting Qi ¹, Dinghuai Zhang ³,
Kianté Brantley ^1,4, Sham M. Kakade ^1,4, Hanlin Zhang ¹

¹Harvard University ²Cornell University ³Microsoft Research ⁴Kempner Institute

Paper (PDF) arXiv Code Coming Soon

Motivation

Reinforcement learning from human feedback (RLHF) [1] has been central to aligning modern language models, but it hinges on a critical component: the reward signal. High-quality preference data is expensive to collect and annotate [2], [3], [4]. Moreover, even preference labels annotated by human or strong LLM judges can often be subjective and noisy [1], [5], [6], [7], [8], further amplifying reward hacking behaviors like deception or alignment faking. The table below shows estimated candidate completion and annotation cost on two representative datasets UltraFeedback [2] and UltraInteract [3] instantiated with a GPT-4–class model gpt-4o (Dataset statistics from [9]).

Dataset	#Pairs	Prompt Length	Chosen Length	Rejected Length	Completion Cost ($)	Judge cost ($)
UltraFeedback	340,025	161.5	279.5	211.1	1,444.36	725.83
UltraInteract	161,927	507.4	396.6	416.7	1,217.89	616.28

This motivates a key question: how much typical reward-model supervision can be learned in an unsupervised manner? We propose Reward-Based Scaling (RBS): an unsupervised framework that converts web text into implicit preferences by treating natural continuations as "chosen samples" and mismatched continuations as in-batch negatives "rejected samples"—producing online preference data at essentially zero annotation cost.

Method: Continuation-Based Online Preference Pairing

We present an online, continuation-based preference labeling approach that turns raw web text into pairwise supervision for training a reward model (RM)—no curated preference pairs or human labels required.

The key idea is to reuse next-token continuation as an implicit preference signal. For each text sequence, we sample a random breakpoint, split it into a prefix prompt p and its true continuation r. In a minibatch of B prefix–continuation pairs {(p_i, r_i)}_i=1^B, we treat r_i as the chosen response for p_i, and use the other continuations {r_j}_{j ≠ i} as rejected in-batch negatives. This creates an all-to-all set of preference comparisons on the fly.

Given an RM score $s_\theta(p,r)$, we optimize a Bradley–Terry (pairwise logistic) objective with in-batch negatives:

\[ \mathcal{L}_{\text{BT}} = \frac{1}{B}\sum_{i=1}^{B}\frac{1}{B-1}\sum_{j\neq i} -\log \sigma\!\big(s_{\theta}(p_i, r_i) - s_{\theta}(p_i, r_j)\big). \]

with a quadratic score-centering regularizer to prevent reward drift and overly confident, large-magnitude scores:

\[ \mathcal{L}_{\text{center}} = \mathbb{E}\Big[ s_{\theta}(p_i,r_i)^2 + \frac{1}{B-1}\sum_{j\neq i} s_{\theta}(p_i,r_j)^2 \Big]. \]

Making up our final training objective: $ \mathcal{L}=\mathcal{L}_{\text{BT}} + c\,\mathcal{L}_{\text{center}} $, where $c$ controls the strength of centering.

Workflow diagram showing prefix/suffix splitting and in-batch preference pairing

Protocol

We evaluate our RM training recipe with a unified protocol spanning training, preference-alignment benchmarking, and downstream utility evaluation(best-of-N selection and RL).

Data and RM Backbones

Training corpora. We train RMs on math-heavy web text from FineMath and InfiMM-WebMath under a fixed budget of 11M tokens. Data entries are concatenated into a continuous stream, chunked into fixed-length sequences of L tokens, and split into prefix–suffix pairs with L₁ prompt and L₂ continuation tokens.

Backbones. To test robustness across model types and scales, we train RMs from two backbone families:

Base: Llama-3.2-1B, Llama-3.2-3B
Instruction-tuned: Qwen2.5-3B-Instruct, Qwen2.5-7B-Instruct

Preference-Alignment Benchmarks

We evaluate preference alignment on RewardBench v1 and RewardBench v2. As our training data is math-focused, we report RewardBench subsets performance categorized into in-domain (ID) and out-of-domain (OOD). Specifically at the 4 following levels:

Overall: average across all subsets
ID: Reasoning subset for RewardBench v1 and Math subset for RewardBench v2
OOD-Safety: Safety subsets for both RewardBench v1 and v2
OOD-Others: other subsets aggregated

Downstream Utility: Best-of-N Selection

Tasks. ID: MATH500, GSM8K. OOD-Safety: Toxigen, IFEval.

Protocol. We evaluate two trained reward models (FineMath-RM-Llama-3.2-3B and FineMath-RM-Qwen-2.5-7B) using best-of-N: for each input, we sample N candidate solutions from an actor, score each candidate with the RM, and evaluate the highest-scoring selection. We sweep N ∈ {1, 2, 4, 8, 16, 32} using three actors: Llama-3.2-1B-Instruct, Llama-3.2-3B-Instruct, and Llama-3.1-8B-Instruct.

Metric. Accuracy as a function of N; we summarize with MAP, the maximum accuracy achieved along the best-of-N curve.

Downstream Utility: Policy Optimization

Tasks. To assess how well the RM serves as a learning signal for actor training, we train actors on the MATH and GSM8K training splits and evaluate generalization on the corresponding test splits.

Protocol. We train actors with GRPO using FineMath-RM-Qwen-2.5-7B to provide the reward signal. We generate rollouts with Llama-3.2-3B-Instruct and Llama-3.1-8B-Instruct actors under a fixed epoch budget.

Metric. mean@1 accuracy on the corresponding test sets.

Data Scaling

In a controlled study on Llama-3.2-3B, training with our method on FineMath shows steady performance gains on all subsets of RewardBench v1 and v2 over the 11M-token budget, indicating that raw math web text can provide a useful preference-learning signal even without curated comparison data.

Scalability with respect to data size: RewardBench v1/v2 vs token budget

Moreover, we find that the recipe transfers well across model families and scales. When compared within parameter-matched cohorts, all trained RMs rank strongly—top-10 on RewardBench v2 in their respective cohort—and show similarly consistent competitiveness on RewardBench v1, indicating robust cross-backbone transfer.

Backbone	RB v1		RB v2
Backbone	Score	Rank	Score	Rank
Llama-3.1-1B	60.0	3	33.2	3
Llama-3.1-3B	65.6	4	36.2	9
Qwen2.5-3B-Inst.	70.0	3	46.2	8
Qwen2.5-7B-Inst.	73.8	18	57.0	5

Recipe Ablations

We ablate key factors in the training pipeline: batch size, data quality, splitting format and centering loss.

Increasing batch size in our online preference pairing scheme boosts reward model performance because each update gets B positives and B(B−1) cross-pair negatives, making supervision grow nearly quadratically with B. In experiments, larger batches yield higher and more stable RewardBench v2 gains—ID keeps improving with B, while OOD benefits mostly saturate beyond B=16.

batch size	ID	OOD Safety	OOD others	RBv2
8	+9.6	+3.4	+0.6	+1.0
16	+10.7	+8.3	+7.7	+6.7
32	+16.1	+5.4	+7.4	+7.7

RewardBench v2 learning curves across 11M token budget vs. batch size

Higher-quality raw math text yields larger reward-model gains. FineMath-4plus consistently outperforms Infiwebmath-4plus with bigger ID/OOD improvements and a higher average RewardBench v2 boost, while also learning more steadily over the 11M-token budget.

dataset	ID	OOD Safety	OOD others	RBv2
infiwebmath	+14.2	+4.4	+5.1	+5.3
finemath	+16.1	+5.4	+7.4	+7.7

RewardBench v2 learning curves across 11M token budget vs. training datasets

Allowing mid-sentence splits (“break sentence”) produces substantially larger RewardBench v2 gains than splitting only at sentence boundaries, improving both ID and OOD performance. This is likely because break-sentence batching creates harder, more contextually confusable in-batch negatives, forcing the reward model to learn finer-grained semantic consistency.

splitting	ID	OOD Safety	OOD Others	RBv2
preserve sentence	+4.6	+1.7	+3.5	+0.8
break sentence	+16.1	+5.4	+7.4	+7.7

RewardBench v2 learning curves vs. data splitting design

Adding the score-centering regularizer stabilizes reward-model training on noisy web text and improves overall RewardBench v2 peak gains (+7.7 vs. +6.4), whereas removing it yields uneven, less reliable improvements. Centering also makes best-of-(N) selection much more dependable—boosting GSM8K (ΔMAP, +5.2 vs. +2.8) and avoiding the large-(N) degradation seen with uncentered rewards.

center loss	ID	OOD Safety	OOD others	RBv2	BoN
w/o	+16.9	+5.9	+8.6	+6.4	+2.8
w/	+16.4	+5.7	+7.4	+7.7	+5.4

RewardBench v2 learning curves across 11M token budget vs. centering loss

Best-of-N Selection

RMs trained using our recipe provide strong best-of-N selection signals on both in-domain math tasks (GSM8K, MATH500) and out-of-domain settings (Toxigen, IFEval) , with BoN improvements becoming clearer and more monotonic as the actor model gets larger and more capable. We also compare against two strong baselines, Skywork-Reward-V2-Llama-3.2-3B and Skywork-Reward-V2-Llama-3.1-8B trained from 26M high-quality curated preference pairs. While Skywork models are usually stronger in absolute terms, the gap shrinks with stronger actors—and our FineMath-RM-Qwen-2.5-7B is competitive with (and sometimes surpasses) a Skywork baseline on MATH500 and Toxigen, despite using only noisy, unlabeled web text.

BoN scaling curves across tasks and actors

Policy Optimization with GRPO

FineMath-RM-Qwen-2.5-7B trained using our recipe provides a reliable reward signal for actor training, consistently improving mean@1 accuracy on GSM8K and MATH across both Llama-3.2-3B-Instruct and Llama-3.1-8B-Instruct actors. It often achieves the best or second-best results when compared against strong off-the-shelf reward model baselines. The untrained “-Init” control is consistently the weakest, showing that the gains come from the learned reward signal rather than initialization. While Skywork reward models remain stronger overall—especially on MATH—our web-trained RM is still competitive despite using no curated preference data.

	Actor: Llama-3.2-3B-Instruct		Actor: Llama-3.1-8B-Instruct
	MATH	GSM8K	MATH	GSM8K

Before training	0.268	0.789	0.423	0.876
Skywork-Reward-V2-Llama-3.1-8B	0.419	0.833	0.447	0.882
Skywork-Reward-V2-Llama-3.2-3B	0.439	0.819	0.465	0.884
FineMath-RM-Qwen-2.5-7B	0.420	0.823	0.437	0.886
FineMath-RM-Qwen-2.5-7B-Init	0.399	0.812	0.428	0.879

References

Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., & Amodei, D. (2017). Deep reinforcement learning from human preferences. Advances in Neural Information Processing Systems, 30. https://proceedings.neurips.cc/paper/2017/hash/d5e2c0adad503c91f91df240d0cd4e49-Abstract.html
Cui, G., Yuan, L., Ding, N., Yao, G., He, B., Zhu, W., Ni, Y., Xie, G., Xie, R., Lin, Y., Liu, Z., & Sun, M. (2024). UltraFeedback: Boosting Language Models with Scaled AI Feedback. arXiv:2310.01377. https://arxiv.org/abs/2310.01377
Yuan, L., Cui, G., Wang, H., Ding, N., Wang, X., Deng, J., Shan, B., Chen, H., Xie, R., Lin, Y., Liu, Z., Zhou, B., Peng, H., Liu, Z., & Sun, M. (2024). Advancing LLM Reasoning Generalists with Preference Trees. arXiv:2404.02078. https://arxiv.org/abs/2404.02078
Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., & others. (2022). Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv:2204.05862. https://arxiv.org/abs/2204.05862
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., & others. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35. https://proceedings.neurips.cc/paper/2022/hash/b1efde53be364a73914f58805a001731-Abstract-Conference.html
Casper, S., Davies, X., Shi, C., Gilbert, T. K., Scheurer, J., Rando, J., Freedman, R., Korbak, T., Lindner, D., Freire, P., & others. (2023). Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv:2307.15217. https://arxiv.org/abs/2307.15217
Zhang, M. J. Q., Wang, Z., Hwang, J. D., Dong, Y., Delalleau, O., Choi, Y., Choi, E., Ren, X., & Pyatkin, V. (2024). Diverging Preferences: When do Annotators Disagree and do Models Know? arXiv:2410.14632. https://arxiv.org/abs/2410.14632
Gao, Y., Alon, D., & Metzler, D. (2024). Impact of preference noise on the alignment performance of generative language models. arXiv:2404.09824. https://arxiv.org/abs/2404.09824
Dong, H., Xiong, W., Pang, B., Wang, H., Zhao, H., Zhou, Y., Jiang, N., Sahoo, D., Xiong, C., & Zhang, T. (2024). RLHF Workflow: From Reward Modeling to Online RLHF. arXiv:2405.07863. https://arxiv.org/abs/2405.07863

BibTeX

@misc{anonymous2026scalingrewardmodeling,
  title   = {Scaling Reward Modeling without Human Supervision},
  author  = {Anonymous Authors},
  note    = {Under review},
  year    = {2026}
}

Replace with the final BibTeX (authors/venue/arXiv/OpenReview link) once available.