Gradient Descent

Watching a ball roll downhill — and feeling why momentum matters.

Optimization is the engine of machine learning. Gradient descent is how a model finds its way to a good solution — not by magic, but by walking downhill on a landscape of error. This visualization lets you watch that walk happen, see how step size affects the journey, and feel why momentum transforms a stumbling descent into a flowing path. The Rosenbrock function is the terrain: a deceptively simple valley that has frustrated optimizers for decades.

Learning Rate
0.002

Momentum
0.9

Click anywhere on the surface to start a new optimization path from that point.

The Rosenbrock Function

The terrain comes from a classic optimization benchmark: the Rosenbrock function, defined as f(x, y) = (a - x)² + b(y - x²)². With parameters a=1 and b=100, it creates a narrow curved valley that winds toward the minimum at (1, 1). The global optimum sits inside a parabolic valley whose walls steepen dramatically away from the center. This geometry makes it a perfect test case: gradient descent must navigate not just downhill, but along a curved valley floor.

function rosenbrock(x, y) {
  const a = 1;
  const b = 100;
  return Math.pow(a - x, 2) + b * Math.pow(y - x * x, 2);
}

The function's deceptive simplicity hides why it's difficult. The minimum lies in a narrow groove that winds through parameter space. Approaching from most starting points, the gradient points steeply across the valley rather than along it — you can see this in how the contour lines squeeze together. A naive descent will oscillate across the valley walls, making slow progress along the valley floor.

Gradient Computation

At each point, the optimizer needs to know which direction is downhill. The gradient — a vector of partial derivatives — points to the steepest ascent. Negative that vector points to the steepest descent. For Rosenbrock, the gradient is computed analytically: ∂f/∂x = -2(a - x) - 4bx(y - x²) and ∂f/∂y = 2b(y - x²).

function gradient(x, y) {
  const a = 1;
  const b = 100;
  const dfdx = -2 * (a - x) - 4 * b * x * (y - x * x);
  const dfdy = 2 * b * (y - x * x);
  return [dfdx, dfdy];
}

The gradient is expensive in high dimensions but essential. In this 2D visualization, you can see the gradient vector as the direction the particle would move without momentum — directly toward the nearest downhill point. But raw gradient descent has a problem: it doesn't build velocity. Each step starts fresh, losing whatever motion it accumulated.

Learning Rate Effects

The learning rate controls how far each step travels along the gradient. Set it too small, and optimization crawls — thousands of steps to reach the valley. Set it too large, and the optimizer overshoots, bouncing across the valley walls or diverging entirely. The sweet spot depends on the curvature of the surface. Rosenbrock's varying curvature makes a fixed rate problematic: what works on the steep walls fails on the flat valley floor.

const lr = parseFloat(document.getElementById('learningRate').value);
const [dx, dy] = gradient(x, y);
// Without momentum: direct step
vx = -lr * dx;
vy = -lr * dy;

Try lowering the learning rate to 0.0005 and watching the path crawl. Now raise it to 0.02 and watch it overshoot, oscillating wildly before settling. The rate is the optimizer's stride length, and stride that doesn't adapt to terrain is a liability on varied landscapes.

Momentum Physics

Momentum changes everything. Instead of moving directly along the gradient, the optimizer accumulates velocity. Each step combines the new gradient with the previous momentum, building speed in consistent downhill directions while damping oscillations across them. The update becomes: velocity = momentum × velocity - learning_rate × gradient; position += velocity. This is how a ball rolls — it carries inertia.

function step() {
  const [dx, dy] = gradient(x, y);
  vx = momentum * vx - lr * dx;
  vy = momentum * vy - lr * dy;
  x += vx;
  y += vy;
}

On Rosenbrock's valley, momentum is transformative. Without it, the optimizer zigzags across the valley floor, fighting a gradient that points mostly sideways. With momentum, velocity accumulates along the valley direction, smoothing out the perpendicular oscillations. The particle accelerates through the valley like a ball rolling down a groove, picking up speed where the direction is consistent, slowing where the gradient shifts.

Trail Visualization

The trail shows history. Each position is stored in a bounded array, and the renderer draws line segments from oldest to newest with fading alpha. The color shifts along the path — teal near the start, amber near the end — encoding time visually. A bright dot marks the current position. The trail makes the optimization path tangible: you can see where momentum carried the optimizer through a flat region, where it bounced off a wall, where it accelerated smoothly.

trail.push({ x, y });
if (trail.length > MAX_TRAIL) trail.shift();

// In render:
for (let i = 1; i < trail.length; i++) {
  const alpha = i / trail.length;
  ctx.strokeStyle = `hsla(${160 - 140 * alpha}, 70%, 50%, ${alpha})`;
  ctx.beginPath();
  ctx.moveTo(sx0, sy0);
  ctx.lineTo(sx1, sy1);
  ctx.stroke();
}

The trail is not just aesthetic. It reveals optimization dynamics that the moment-by-moment view hides. Divergence shows as widening spirals. Slow convergence shows as a tight crawl along the valley. Efficient optimization leaves a smooth, accelerating arc toward the minimum.

What This Reveals

Gradient descent is not a search in the abstract. It is a walk through physical space — albeit a space of parameters rather than coordinates. Momentum is not a metaphor: it is literal inertia, carried forward from one step to the next. When you adjust the momentum slider, you are giving the particle mass. When you adjust learning rate, you are setting its stride. The path is real, and watching it unfold makes the mathematics of optimization feel like what they are: mechanics.

The Rosenbrock valley exposes why optimization is hard. The gradient does not point toward the minimum — it points steeply down the nearest wall. To reach the optimum, the optimizer must accumulate motion along a direction the gradient only weakly suggests. Momentum is the mechanism that makes this possible: it filters out the noise of local gradient fluctuations and preserves signal in directions that matter. This is why momentum is ubiquitous in deep learning, and why watching descent on a curved surface makes the intuition obvious in a way equations never quite deliver.