Scaling Reward Modeling without Human Supervision

1Harvard University    2Cornell University    3Microsoft Research    4Kempner Institute

Motivation

Reinforcement learning from human feedback (RLHF) [1] has been central to aligning modern language models, but it hinges on a critical component: the reward signal. High-quality preference data is expensive to collect and annotate [2], [3], [4]. Moreover, even preference labels annotated by human or strong LLM judges can often be subjective and noisy [1], [5], [6], [7], [8], further amplifying reward hacking behaviors like deception or alignment faking. The table below shows estimated candidate completion and annotation cost on two representative datasets UltraFeedback [2] and UltraInteract [3] instantiated with a GPT-4–class model gpt-4o (Dataset statistics from [9]).

Dataset #Pairs Prompt Length Chosen Length Rejected Length Completion Cost ($) Judge cost ($)
UltraFeedback 340,025 161.5 279.5 211.1 1,444.36 725.83
UltraInteract 161,927 507.4 396.6 416.7 1,217.89 616.28
This motivates a key question: how much typical reward-model supervision can be learned in an unsupervised manner? We propose Reward-Based Scaling (RBS): an unsupervised framework that converts web text into implicit preferences by treating natural continuations as "chosen samples" and mismatched continuations as in-batch negatives "rejected samples"—producing online preference data at essentially zero annotation cost.


Method: Continuation-Based Online Preference Pairing

We present an online, continuation-based preference labeling approach that turns raw web text into pairwise supervision for training a reward model (RM)—no curated preference pairs or human labels required.

The key idea is to reuse next-token continuation as an implicit preference signal. For each text sequence, we sample a random breakpoint, split it into a prefix prompt p and its true continuation r. In a minibatch of B prefix–continuation pairs {(pi, ri)}i=1B, we treat ri as the chosen response for pi, and use the other continuations {rj}j ≠ i as rejected in-batch negatives. This creates an all-to-all set of preference comparisons on the fly.

Given an RM score \(s_\theta(p,r)\), we optimize a Bradley–Terry (pairwise logistic) objective with in-batch negatives:

\[ \mathcal{L}_{\text{BT}} = \frac{1}{B}\sum_{i=1}^{B}\frac{1}{B-1}\sum_{j\neq i} -\log \sigma\!\big(s_{\theta}(p_i, r_i) - s_{\theta}(p_i, r_j)\big). \]

with a quadratic score-centering regularizer to prevent reward drift and overly confident, large-magnitude scores:

\[ \mathcal{L}_{\text{center}} = \mathbb{E}\Big[ s_{\theta}(p_i,r_i)^2 + \frac{1}{B-1}\sum_{j\neq i} s_{\theta}(p_i,r_j)^2 \Big]. \]

Making up our final training objective: \( \mathcal{L}=\mathcal{L}_{\text{BT}} + c\,\mathcal{L}_{\text{center}} \), where \(c\) controls the strength of centering.

Workflow diagram showing prefix/suffix splitting and in-batch preference pairing

Protocol

We evaluate our RM training recipe with a unified protocol spanning training, preference-alignment benchmarking, and downstream utility evaluation(best-of-N selection and RL).

Data and RM Backbones

Training corpora. We train RMs on math-heavy web text from FineMath and InfiMM-WebMath under a fixed budget of 11M tokens. Data entries are concatenated into a continuous stream, chunked into fixed-length sequences of L tokens, and split into prefix–suffix pairs with L1 prompt and L2 continuation tokens.

Backbones. To test robustness across model types and scales, we train RMs from two backbone families:

  • Base: Llama-3.2-1B, Llama-3.2-3B
  • Instruction-tuned: Qwen2.5-3B-Instruct, Qwen2.5-7B-Instruct

Preference-Alignment Benchmarks

We evaluate preference alignment on RewardBench v1 and RewardBench v2. As our training data is math-focused, we report RewardBench subsets performance categorized into in-domain (ID) and out-of-domain (OOD). Specifically at the 4 following levels:

  • Overall: average across all subsets
  • ID: Reasoning subset for RewardBench v1 and Math subset for RewardBench v2
  • OOD-Safety: Safety subsets for both RewardBench v1 and v2
  • OOD-Others: other subsets aggregated

Downstream Utility: Best-of-N Selection

Tasks. ID: MATH500, GSM8K. OOD-Safety: Toxigen, IFEval.

Protocol. We evaluate two trained reward models (FineMath-RM-Llama-3.2-3B and FineMath-RM-Qwen-2.5-7B) using best-of-N: for each input, we sample N candidate solutions from an actor, score each candidate with the RM, and evaluate the highest-scoring selection. We sweep N ∈ {1, 2, 4, 8, 16, 32} using three actors: Llama-3.2-1B-Instruct, Llama-3.2-3B-Instruct, and Llama-3.1-8B-Instruct.

Metric. Accuracy as a function of N; we summarize with MAP, the maximum accuracy achieved along the best-of-N curve.

Downstream Utility: Policy Optimization

Tasks. To assess how well the RM serves as a learning signal for actor training, we train actors on the MATH and GSM8K training splits and evaluate generalization on the corresponding test splits.

Protocol. We train actors with GRPO using FineMath-RM-Qwen-2.5-7B to provide the reward signal. We generate rollouts with Llama-3.2-3B-Instruct and Llama-3.1-8B-Instruct actors under a fixed epoch budget.

Metric. mean@1 accuracy on the corresponding test sets.


Data Scaling

In a controlled study on Llama-3.2-3B, training with our method on FineMath shows steady performance gains on all subsets of RewardBench v1 and v2 over the 11M-token budget, indicating that raw math web text can provide a useful preference-learning signal even without curated comparison data.

Scalability with respect to data size: RewardBench v1/v2 vs token budget

Moreover, we find that the recipe transfers well across model families and scales. When compared within parameter-matched cohorts, all trained RMs rank strongly—top-10 on RewardBench v2 in their respective cohort—and show similarly consistent competitiveness on RewardBench v1, indicating robust cross-backbone transfer.

Backbone RB v1 RB v2
Score Rank Score Rank
Llama-3.1-1B 60.0 3 33.2 3
Llama-3.1-3B 65.6 4 36.2 9
Qwen2.5-3B-Inst. 70.0 3 46.2 8
Qwen2.5-7B-Inst. 73.8 18 57.0 5

Recipe Ablations

We ablate key factors in the training pipeline: batch size, data quality, splitting format and centering loss.

Increasing batch size in our online preference pairing scheme boosts reward model performance because each update gets B positives and B(B−1) cross-pair negatives, making supervision grow nearly quadratically with B. In experiments, larger batches yield higher and more stable RewardBench v2 gains—ID keeps improving with B, while OOD benefits mostly saturate beyond B=16.

batch size ID OOD Safety OOD others RBv2
8 +9.6 +3.4 +0.6 +1.0
16 +10.7 +8.3 +7.7 +6.7
32 +16.1 +5.4 +7.4 +7.7
RewardBench v2 learning curves across 11M token budget vs. batch size

Higher-quality raw math text yields larger reward-model gains. FineMath-4plus consistently outperforms Infiwebmath-4plus with bigger ID/OOD improvements and a higher average RewardBench v2 boost, while also learning more steadily over the 11M-token budget.

Dataset ID OOD Safety OOD others RBv2
infiwebmath +14.2 +4.4 +5.1 +5.3
finemath +16.1 +5.4 +7.4 +7.7
RewardBench v2 learning curves across 11M token budget vs. training datasets

Allowing mid-sentence splits (“break sentence”) produces substantially larger RewardBench v2 gains than splitting only at sentence boundaries, improving both ID and OOD performance. This is likely because break-sentence batching creates harder, more contextually confusable in-batch negatives, forcing the reward model to learn finer-grained semantic consistency.

Splitting ID OOD Safety OOD Others RBv2
preserve sentence +4.6 +1.7 +3.5 +0.8
break sentence +16.1 +5.4 +7.4 +7.7
RewardBench v2 learning curves vs. data splitting design

Adding the score-centering regularizer stabilizes reward-model training on noisy web text and improves overall RewardBench v2 peak gains (+7.7 vs. +6.4), whereas removing it yields uneven, less reliable improvements. Centering also makes best-of-(N) selection much more dependable—boosting GSM8K (ΔMAP, +5.2 vs. +2.8) and avoiding the large-(N) degradation seen with uncentered rewards.

Centering ID OOD Safety OOD others RBv2 BoN
w/o centering +16.9 +5.9 +8.6 +6.4 +2.8
w/ centering +16.4 +5.7 +7.4 +7.7 +5.4
RewardBench v2 learning curves across 11M token budget vs. centering loss

Best-of-N Selection

RMs trained using our recipe provide strong best-of-N selection signals on both in-domain math tasks (GSM8K, MATH500) and out-of-domain settings (Toxigen, IFEval) , with BoN improvements becoming clearer and more monotonic as the actor model gets larger and more capable. We also compare against two strong baselines, Skywork-Reward-V2-Llama-3.2-3B and Skywork-Reward-V2-Llama-3.1-8B trained from 26M high-quality curated preference pairs. While Skywork models are usually stronger in absolute terms, the gap shrinks with stronger actors—and our FineMath-RM-Qwen-2.5-7B is competitive with (and sometimes surpasses) a Skywork baseline on MATH500 and Toxigen, despite using only noisy, unlabeled web text.

BoN scaling curves across tasks and actors

Policy Optimization with GRPO

FineMath-RM-Qwen-2.5-7B trained using our recipe provides a reliable reward signal for actor training, consistently improving mean@1 accuracy on GSM8K and MATH across both Llama-3.2-3B-Instruct and Llama-3.1-8B-Instruct actors. It often achieves the best or second-best results when compared against strong off-the-shelf reward model baselines. The untrained “-Init” control is consistently the weakest, showing that the gains come from the learned reward signal rather than initialization. While Skywork reward models remain stronger overall—especially on MATH—our web-trained RM is still competitive despite using no curated preference data.

Actor: Llama-3.2-3B-Instruct Actor: Llama-3.1-8B-Instruct
MATH GSM8K MATH GSM8K

Before training 0.268 0.789 0.423 0.876
Skywork-Reward-V2-Llama-3.1-8B 0.419 0.833 0.447 0.882
Skywork-Reward-V2-Llama-3.2-3B 0.439 0.819 0.465 0.884
FineMath-RM-Qwen-2.5-7B 0.420 0.823 0.437 0.886
FineMath-RM-Qwen-2.5-7B-Init 0.399 0.812 0.428 0.879

References

  1. Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., & Amodei, D. (2017). Deep reinforcement learning from human preferences. Advances in Neural Information Processing Systems, 30. https://proceedings.neurips.cc/paper/2017/hash/d5e2c0adad503c91f91df240d0cd4e49-Abstract.html
  2. Cui, G., Yuan, L., Ding, N., Yao, G., He, B., Zhu, W., Ni, Y., Xie, G., Xie, R., Lin, Y., Liu, Z., & Sun, M. (2024). UltraFeedback: Boosting Language Models with Scaled AI Feedback. arXiv:2310.01377. https://arxiv.org/abs/2310.01377
  3. Yuan, L., Cui, G., Wang, H., Ding, N., Wang, X., Deng, J., Shan, B., Chen, H., Xie, R., Lin, Y., Liu, Z., Zhou, B., Peng, H., Liu, Z., & Sun, M. (2024). Advancing LLM Reasoning Generalists with Preference Trees. arXiv:2404.02078. https://arxiv.org/abs/2404.02078
  4. Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., & others. (2022). Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv:2204.05862. https://arxiv.org/abs/2204.05862
  5. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., & others. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35. https://proceedings.neurips.cc/paper/2022/hash/b1efde53be364a73914f58805a001731-Abstract-Conference.html
  6. Casper, S., Davies, X., Shi, C., Gilbert, T. K., Scheurer, J., Rando, J., Freedman, R., Korbak, T., Lindner, D., Freire, P., & others. (2023). Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv:2307.15217. https://arxiv.org/abs/2307.15217
  7. Zhang, M. J. Q., Wang, Z., Hwang, J. D., Dong, Y., Delalleau, O., Choi, Y., Choi, E., Ren, X., & Pyatkin, V. (2024). Diverging Preferences: When do Annotators Disagree and do Models Know? arXiv:2410.14632. https://arxiv.org/abs/2410.14632
  8. Gao, Y., Alon, D., & Metzler, D. (2024). Impact of preference noise on the alignment performance of generative language models. arXiv:2404.09824. https://arxiv.org/abs/2404.09824
  9. Dong, H., Xiong, W., Pang, B., Wang, H., Zhao, H., Zhou, Y., Jiang, N., Sahoo, D., Xiong, C., & Zhang, T. (2024). RLHF Workflow: From Reward Modeling to Online RLHF. arXiv:2405.07863. https://arxiv.org/abs/2405.07863

BibTeX

@misc{anonymous2026scalingrewardmodeling,
  title   = {Scaling Reward Modeling without Human Supervision},
  author  = {Anonymous Authors},
  note    = {Under review},
  year    = {2026}
}

Replace with the final BibTeX (authors/venue/arXiv/OpenReview link) once available.