tune(rl): drop SCORE_DELTA_SCALE 1e-3 -> 1e-4 for the unified raw score
Some checks failed
ci / regression gate (push) Failing after 54s

score_estimate is now the unbounded unified score (~10-20x the old clamped [0,1000] magnitude);
scale the per-turn score-delta reward down to keep it in range with the other reward terms.
Empirical retune tracked for when the self-play stable resumes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
Natalie 2026-06-30 20:40:48 -04:00
parent d25b80cf98
commit e1f3a66a67

View file

@ -82,7 +82,13 @@ OPPONENT_ELIMINATED = 0.50
# the dense intra-turn gradient. The slow-game ramp adds linearly-
# growing per-step pressure after SLOW_PENALTY_START turns, reaching
# SLOW_PENALTY_PEAK per step at turn SLOW_PENALTY_START + SLOW_PENALTY_SPAN.
SCORE_DELTA_SCALE = 1e-3
#
# NOTE: score_estimate is now the UNIFIED raw score (mc-score ScoreController,
# unbounded) — ~10-20x larger magnitude than the old clamped [0,1000] scale, so
# SCORE_DELTA_SCALE was dropped from 1e-3 to 1e-4 to keep the per-turn score
# reward in the same range as the other terms. Retune empirically once the
# self-play stable resumes training on the unified objective.
SCORE_DELTA_SCALE = 1e-4
STEP_PENALTY_BASE = 5e-4
SLOW_PENALTY_PEAK = 1e-3
SLOW_PENALTY_START = 500