By the end of Lesson 2.4, the backward pass should feel mechanical:
- seed the output gradient
- walk backward in a valid order
- apply each node's local rule
- accumulate the totals on upstream nodes
That is enough to run backprop on a tiny graph. It is also enough to get the wrong answer if we mix up the bookkeeping.
This lesson uses the same five-node graph to surface the gotchas that produce wrong gradients and to make the distinction between data and grad feel concrete.
Start with a wrong result
Use the same graph:
a = Value(2.0)
b = Value(-3.0)
c = a * b
d = a + b
e = c + dFrom Lessons 2.3 and 2.4, we expect:
Now imagine a backward pass that ends with:
Those numbers are wrong. They only include the contribution from the d = a + b branch, so the multiply branch has been lost. That is the first big backprop gotcha: some gradients need to accumulate more than one contribution. If we forget that, the final numbers can look clean and still be wrong.
Overwrite Versus Accumulate
In this graph, a receives gradient through two paths:
a -> c -> e
a -> d -> eSo a.grad needs to receive two contributions: -3 from c and +1 from d. Those contributions must be added together.
Wrong:
a.grad = b * c.grad
a.grad = 1 * d.gradThe second line overwrites the first contribution.
Right:
a.grad += b * c.grad
a.grad += 1 * d.gradNow both contributions survive and the stored total becomes:
The same idea applies to b:
Backprop works because each node stores a running total, not just the most recent message it received.
Data Versus Grad
Each node carries at least two different kinds of information:
data: the value at this nodegrad: how the final output changes if this value changes a little
For example:
a.data = 2
a.grad = -2Those numbers are not two copies of the same thing. a.data = 2 says what a is. a.grad = -2 says how the final output responds if a changes.
The same distinction holds for intermediate nodes:
c.data = -6
c.grad = 1c.data is the forward value of the multiply node. c.grad is the backward influence that reached that node from the output.
If data and grad get blurred together, the backward pass stops making sense very quickly.
Why small graphs matter
On a tiny graph, wrong gradients are visible. We can inspect:
- the paths
- the local rules
- the order of the backward pass
- the running totals on each node
That makes small graphs a good place to catch bugs like:
- forgetting to accumulate
- mixing up
dataandgrad
If the same mistakes happen in a large system, they are much harder to see.
So the practical lesson is simple:
Debug the idea on a small graph before trusting it on a large one.
Tiny checkpoint
Suppose a backward pass on the familiar graph ends like this:
a.grad = 1
b.grad = 1Answer before revealing:
- Which branch contribution is still present?
- Which branch contribution has been lost?
- What is the most likely bookkeeping bug?
- What should the correct final values be?
Reveal answers
- The add branch contribution.
- The multiply branch contribution.
- Overwrite instead of accumulate.
a.grad = -2,b.grad = 3.
That is the kind of diagnosis this lesson wants to make feel routine.