Reinforcement learning from human feedback (RLHF) [1] has been central to aligning modern language models, but it hinges on a critical component: the reward signal. High-quality preference data is expensive to collect and annotate [2], [3], [4]. Moreover, even preference labels annotated by human or strong LLM judges can often be subjective and noisy [1], [5], [6], [7], [8], further amplifying reward hacking behaviors like deception or alignment faking.
The table below shows estimated candidate completion and annotation cost on two representative datasets UltraFeedback [2] and UltraInteract [3] instantiated with a GPT-4–class model gpt-4o (Dataset statistics from [9]).
| Dataset |
#Pairs |
Prompt Length |
Chosen Length |
Rejected Length |
Completion Cost ($) |
Judge cost ($) |
| UltraFeedback |
340,025 |
161.5 |
279.5 |
211.1 |
1,444.36 |
725.83 |
| UltraInteract |
161,927 |
507.4 |
396.6 |
416.7 |
1,217.89 |
616.28 |
This motivates a key question:
how much typical reward-model supervision can be learned in an unsupervised manner?
We propose
Reward-Based Scaling (RBS): an unsupervised framework that converts web text into implicit preferences by treating natural continuations as "chosen samples" and mismatched continuations as in-batch negatives "rejected samples"—producing online preference data at essentially
zero annotation cost.
We present an online, continuation-based preference labeling approach that turns raw web text into
pairwise supervision for training a reward model (RM)—no curated preference pairs or human labels required.
The key idea is to reuse next-token continuation as an implicit preference signal. For each text sequence,
we sample a random breakpoint, split it into a prefix prompt
p and its true continuation r.
In a minibatch of B prefix–continuation pairs
{(pi, ri)}i=1B,
we treat ri as the chosen response for pi, and use the other continuations
{rj}j ≠ i
as rejected in-batch negatives. This creates an all-to-all set of preference comparisons on the fly.
Given an RM score \(s_\theta(p,r)\), we optimize a Bradley–Terry (pairwise logistic)
objective with in-batch negatives:
\[
\mathcal{L}_{\text{BT}}
= \frac{1}{B}\sum_{i=1}^{B}\frac{1}{B-1}\sum_{j\neq i}
-\log \sigma\!\big(s_{\theta}(p_i, r_i) - s_{\theta}(p_i, r_j)\big).
\]
with a quadratic score-centering regularizer to prevent
reward drift and overly confident, large-magnitude scores:
\[
\mathcal{L}_{\text{center}} =
\mathbb{E}\Big[ s_{\theta}(p_i,r_i)^2 + \frac{1}{B-1}\sum_{j\neq i} s_{\theta}(p_i,r_j)^2 \Big].
\]
Making up our final training objective:
\( \mathcal{L}=\mathcal{L}_{\text{BT}} + c\,\mathcal{L}_{\text{center}} \), where \(c\) controls the strength of centering.
We evaluate our RM training recipe with a unified protocol spanning training, preference-alignment benchmarking, and downstream utility evaluation(best-of-N selection and RL).
Data and RM Backbones
Training corpora. We train RMs on math-heavy web text from FineMath and InfiMM-WebMath
under a fixed budget of 11M tokens. Data entries are concatenated into a continuous stream,
chunked into fixed-length sequences of L tokens, and split into
prefix–suffix pairs with L1 prompt and L2 continuation tokens.
Backbones. To test robustness across model types and scales, we train RMs
from two backbone families:
- Base: Llama-3.2-1B, Llama-3.2-3B
- Instruction-tuned: Qwen2.5-3B-Instruct, Qwen2.5-7B-Instruct
Preference-Alignment Benchmarks
We evaluate preference alignment on RewardBench v1 and RewardBench v2.
As our training data is math-focused, we report RewardBench subsets performance categorized into in-domain (ID) and out-of-domain (OOD). Specifically at the 4 following levels:
- Overall: average across all subsets
- ID:
Reasoning subset for RewardBench v1 and Math subset for RewardBench v2
- OOD-Safety:
Safety subsets for both RewardBench v1 and v2
- OOD-Others: other subsets aggregated
Downstream Utility: Best-of-N Selection
Tasks.
ID: MATH500, GSM8K. OOD-Safety: Toxigen, IFEval.
Protocol.
We evaluate two trained reward models (FineMath-RM-Llama-3.2-3B and FineMath-RM-Qwen-2.5-7B)
using best-of-N: for each input, we sample N candidate solutions from an actor, score each candidate
with the RM, and evaluate the highest-scoring selection. We sweep
N ∈ {1, 2, 4, 8, 16, 32} using three actors:
Llama-3.2-1B-Instruct, Llama-3.2-3B-Instruct, and Llama-3.1-8B-Instruct.
Metric.
Accuracy as a function of N; we summarize with MAP, the maximum accuracy achieved along the
best-of-N curve.
Downstream Utility: Policy Optimization
Tasks.
To assess how well the RM serves as a learning signal for actor training, we train actors on the MATH and GSM8K
training splits and evaluate generalization on the corresponding test splits.
Protocol.
We train actors with GRPO using FineMath-RM-Qwen-2.5-7B to provide the reward signal. We generate rollouts with
Llama-3.2-3B-Instruct and Llama-3.1-8B-Instruct actors under a fixed epoch budget.
Metric.
mean@1 accuracy on the corresponding test sets.
Data Scaling
In a controlled study on Llama-3.2-3B, training with our method on FineMath shows steady performance gains on all subsets of RewardBench v1 and v2 over the 11M-token budget, indicating that raw math web text can provide a useful preference-learning signal even without curated comparison data.
Moreover, we find that the recipe transfers well across model families and scales. When compared within parameter-matched cohorts, all trained RMs rank strongly—top-10 on RewardBench v2 in their respective cohort—and show similarly consistent competitiveness on RewardBench v1, indicating robust cross-backbone transfer.
| Backbone |
RB v1 |
RB v2 |
| Score |
Rank |
Score |
Rank |
| Llama-3.1-1B |
60.0 |
3 |
33.2 |
3 |
| Llama-3.1-3B |
65.6 |
4 |
36.2 |
9 |
| Qwen2.5-3B-Inst. |
70.0 |
3 |
46.2 |
8 |
| Qwen2.5-7B-Inst. |
73.8 |
18 |
57.0 |
5 |
Recipe Ablations
We ablate key factors in the training pipeline:
batch size, data quality, splitting format
and centering loss.
Increasing batch size in our online preference pairing scheme boosts reward model performance because each update gets B positives and B(B−1) cross-pair negatives, making supervision grow nearly quadratically with B. In experiments, larger batches yield higher and more stable RewardBench v2 gains—ID keeps improving with B, while OOD benefits mostly saturate beyond B=16.
| batch size |
ID |
OOD Safety |
OOD others |
RBv2 |
| 8 |
+9.6 |
+3.4 |
+0.6 |
+1.0 |
| 16 |
+10.7 |
+8.3 |
+7.7 |
+6.7 |
| 32 |
+16.1 |
+5.4 |
+7.4 |
+7.7 |
Higher-quality raw math text yields larger reward-model gains. FineMath-4plus consistently outperforms Infiwebmath-4plus with bigger ID/OOD improvements and a higher average RewardBench v2 boost, while also learning more steadily over the 11M-token budget.
| Dataset |
ID |
OOD Safety |
OOD others |
RBv2 |
|
infiwebmath
|
+14.2 |
+4.4 |
+5.1 |
+5.3 |
| finemath |
+16.1 |
+5.4 |
+7.4 |
+7.7 |
Allowing mid-sentence splits (“break sentence”) produces substantially larger RewardBench v2 gains than splitting only at sentence boundaries, improving both ID and OOD performance. This is likely because break-sentence batching creates harder, more contextually confusable in-batch negatives, forcing the reward model to learn finer-grained semantic consistency.
| Splitting |
ID |
OOD Safety |
OOD Others |
RBv2 |
| preserve sentence |
+4.6 |
+1.7 |
+3.5 |
+0.8 |
| break sentence |
+16.1 |
+5.4 |
+7.4 |
+7.7 |
Adding the score-centering regularizer stabilizes reward-model training on noisy web text and improves overall RewardBench v2 peak gains (+7.7 vs. +6.4), whereas removing it yields uneven, less reliable improvements. Centering also makes best-of-(N) selection much more dependable—boosting GSM8K (ΔMAP, +5.2 vs. +2.8) and avoiding the large-(N) degradation seen with uncentered rewards.
| Centering |
ID |
OOD Safety |
OOD others |
RBv2 |
BoN |
| w/o centering |
+16.9 |
+5.9 |
+8.6 |
+6.4 |
+2.8 |
| w/ centering |
+16.4 |
+5.7 |
+7.4 |
+7.7 |
+5.4 |
Best-of-N Selection
RMs trained using our recipe provide strong best-of-N selection signals on both in-domain math tasks (GSM8K, MATH500) and out-of-domain settings (Toxigen, IFEval) , with BoN improvements becoming clearer and more monotonic as the actor model gets larger and more capable.
We also compare against two strong baselines, Skywork-Reward-V2-Llama-3.2-3B and Skywork-Reward-V2-Llama-3.1-8B trained from 26M high-quality curated preference pairs.
While Skywork models are usually stronger in absolute terms, the gap shrinks with stronger actors—and our FineMath-RM-Qwen-2.5-7B is competitive with (and sometimes surpasses) a Skywork baseline on MATH500 and Toxigen, despite using only noisy, unlabeled web text.
Policy Optimization with GRPO
FineMath-RM-Qwen-2.5-7B trained using our recipe provides a reliable reward signal for actor training, consistently improving mean@1 accuracy on GSM8K and MATH across both Llama-3.2-3B-Instruct and Llama-3.1-8B-Instruct actors.
It often achieves the best or second-best results when compared against strong off-the-shelf reward model baselines.
The untrained “-Init” control is consistently the weakest, showing that the gains come from the learned reward signal rather than initialization.
While Skywork reward models remain stronger overall—especially on MATH—our web-trained RM is still competitive despite using no curated preference data.
|
Actor: Llama-3.2-3B-Instruct |
Actor: Llama-3.1-8B-Instruct |
| MATH |
GSM8K |
MATH |
GSM8K |
|
| Before training |
0.268 |
0.789 |
0.423 |
0.876 |
| Skywork-Reward-V2-Llama-3.1-8B |
0.419 |
0.833 |
0.447 |
0.882 |
| Skywork-Reward-V2-Llama-3.2-3B |
0.439 |
0.819 |
0.465 |
0.884 |
| FineMath-RM-Qwen-2.5-7B |
0.420 |
0.823 |
0.437 |
0.886 |
| FineMath-RM-Qwen-2.5-7B-Init |
0.399 |
0.812 |
0.428 |
0.879 |
References
-
Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., & Amodei, D.
(2017). Deep reinforcement learning from human preferences. Advances in Neural Information Processing Systems, 30.
https://proceedings.neurips.cc/paper/2017/hash/d5e2c0adad503c91f91df240d0cd4e49-Abstract.html
-
Cui, G., Yuan, L., Ding, N., Yao, G., He, B., Zhu, W., Ni, Y., Xie, G., Xie, R., Lin, Y., Liu, Z., & Sun, M.
(2024). UltraFeedback: Boosting Language Models with Scaled AI Feedback. arXiv:2310.01377.
https://arxiv.org/abs/2310.01377
-
Yuan, L., Cui, G., Wang, H., Ding, N., Wang, X., Deng, J., Shan, B., Chen, H., Xie, R., Lin, Y., Liu, Z., Zhou, B., Peng, H., Liu, Z., & Sun, M.
(2024). Advancing LLM Reasoning Generalists with Preference Trees. arXiv:2404.02078.
https://arxiv.org/abs/2404.02078
-
Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., & others.
(2022). Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv:2204.05862.
https://arxiv.org/abs/2204.05862
-
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., & others.
(2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35.
https://proceedings.neurips.cc/paper/2022/hash/b1efde53be364a73914f58805a001731-Abstract-Conference.html
-
Casper, S., Davies, X., Shi, C., Gilbert, T. K., Scheurer, J., Rando, J., Freedman, R., Korbak, T., Lindner, D., Freire, P., & others.
(2023). Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv:2307.15217.
https://arxiv.org/abs/2307.15217
-
Zhang, M. J. Q., Wang, Z., Hwang, J. D., Dong, Y., Delalleau, O., Choi, Y., Choi, E., Ren, X., & Pyatkin, V.
(2024). Diverging Preferences: When do Annotators Disagree and do Models Know? arXiv:2410.14632.
https://arxiv.org/abs/2410.14632
-
Gao, Y., Alon, D., & Metzler, D.
(2024). Impact of preference noise on the alignment performance of generative language models. arXiv:2404.09824.
https://arxiv.org/abs/2404.09824
-
Dong, H., Xiong, W., Pang, B., Wang, H., Zhao, H., Zhou, Y., Jiang, N., Sahoo, D., Xiong, C., & Zhang, T.
(2024). RLHF Workflow: From Reward Modeling to Online RLHF. arXiv:2405.07863.
https://arxiv.org/abs/2405.07863
BibTeX
@misc{anonymous2026scalingrewardmodeling,
title = {Scaling Reward Modeling without Human Supervision},
author = {Anonymous Authors},
note = {Under review},
year = {2026}
}
Replace with the final BibTeX (authors/venue/arXiv/OpenReview link) once available.