Skip to content
  1. 2.1
  2. 2.2
  3. 2.3
  4. 2.4
  5. 2.5
Step 1 of 5~11 min left

Chapter 2 · Lesson 2.5

Backprop gotchas on small graphs

The backward pass is simple enough on a five-node graph that wrong gradients become visible. That makes small graphs a good place to catch backprop gotchas before the system grows.

By the end of Lesson 2.4, the backward pass should feel mechanical:

  • seed the output gradient
  • walk backward in a valid order
  • apply each node's local rule
  • accumulate the totals on upstream nodes

That is enough to run backprop on a tiny graph. It is also enough to get the wrong answer if we mix up the bookkeeping.

This lesson uses the same five-node graph to surface the gotchas that produce wrong gradients and to make the distinction between data and grad feel concrete.

Start with a wrong result

Use the same graph:

a = Value(2.0)
b = Value(-3.0)
 
c = a * b
d = a + b
e = c + d

From Lessons 2.3 and 2.4, we expect:

a.grad=2,b.grad=3a.\mathrm{grad} = -2, \qquad b.\mathrm{grad} = 3

Now imagine a backward pass that ends with:

a.grad=1,b.grad=1a.\mathrm{grad} = 1, \qquad b.\mathrm{grad} = 1

Those numbers are wrong. They only include the contribution from the d = a + b branch, so the multiply branch has been lost. That is the first big backprop gotcha: some gradients need to accumulate more than one contribution. If we forget that, the final numbers can look clean and still be wrong.

Overwrite Versus Accumulate

In this graph, a receives gradient through two paths:

a -> c -> e
a -> d -> e

So a.grad needs to receive two contributions: -3 from c and +1 from d. Those contributions must be added together.

Wrong:

a.grad = b * c.grad
a.grad = 1 * d.grad

The second line overwrites the first contribution.

Right:

a.grad += b * c.grad
a.grad += 1 * d.grad

Now both contributions survive and the stored total becomes:

a.grad=3+1=2a.\mathrm{grad} = -3 + 1 = -2

The same idea applies to b:

b.grad=2+1=3b.\mathrm{grad} = 2 + 1 = 3

Backprop works because each node stores a running total, not just the most recent message it received.

Data Versus Grad

Each node carries at least two different kinds of information:

  • data: the value at this node
  • grad: how the final output changes if this value changes a little

For example:

a.data = 2
a.grad = -2

Those numbers are not two copies of the same thing. a.data = 2 says what a is. a.grad = -2 says how the final output responds if a changes.

The same distinction holds for intermediate nodes:

c.data = -6
c.grad = 1

c.data is the forward value of the multiply node. c.grad is the backward influence that reached that node from the output.

If data and grad get blurred together, the backward pass stops making sense very quickly.

Why small graphs matter

On a tiny graph, wrong gradients are visible. We can inspect:

  • the paths
  • the local rules
  • the order of the backward pass
  • the running totals on each node

That makes small graphs a good place to catch bugs like:

  • forgetting to accumulate
  • mixing up data and grad

If the same mistakes happen in a large system, they are much harder to see.

So the practical lesson is simple:

Debug the idea on a small graph before trusting it on a large one.

Tiny checkpoint

Suppose a backward pass on the familiar graph ends like this:

a.grad = 1
b.grad = 1

Answer before revealing:

  1. Which branch contribution is still present?
  2. Which branch contribution has been lost?
  3. What is the most likely bookkeeping bug?
  4. What should the correct final values be?
Reveal answers
  1. The add branch contribution.
  2. The multiply branch contribution.
  3. Overwrite instead of accumulate.
  4. a.grad = -2, b.grad = 3.

That is the kind of diagnosis this lesson wants to make feel routine.