tune(rl): drop SCORE_DELTA_SCALE 1e-3 -> 1e-4 for the unified raw score
Some checks failed
ci / regression gate (push) Failing after 54s
Some checks failed
ci / regression gate (push) Failing after 54s
score_estimate is now the unbounded unified score (~10-20x the old clamped [0,1000] magnitude); scale the per-turn score-delta reward down to keep it in range with the other reward terms. Empirical retune tracked for when the self-play stable resumes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
d25b80cf98
commit
e1f3a66a67
1 changed files with 7 additions and 1 deletions
|
|
@ -82,7 +82,13 @@ OPPONENT_ELIMINATED = 0.50
|
|||
# the dense intra-turn gradient. The slow-game ramp adds linearly-
|
||||
# growing per-step pressure after SLOW_PENALTY_START turns, reaching
|
||||
# SLOW_PENALTY_PEAK per step at turn SLOW_PENALTY_START + SLOW_PENALTY_SPAN.
|
||||
SCORE_DELTA_SCALE = 1e-3
|
||||
#
|
||||
# NOTE: score_estimate is now the UNIFIED raw score (mc-score ScoreController,
|
||||
# unbounded) — ~10-20x larger magnitude than the old clamped [0,1000] scale, so
|
||||
# SCORE_DELTA_SCALE was dropped from 1e-3 to 1e-4 to keep the per-turn score
|
||||
# reward in the same range as the other terms. Retune empirically once the
|
||||
# self-play stable resumes training on the unified objective.
|
||||
SCORE_DELTA_SCALE = 1e-4
|
||||
STEP_PENALTY_BASE = 5e-4
|
||||
SLOW_PENALTY_PEAK = 1e-3
|
||||
SLOW_PENALTY_START = 500
|
||||
|
|
|
|||
Loading…
Add table
Reference in a new issue