Reverse-mode autodiff

Lesson 2.3 got total derivatives by hand:

find every path
multiply along each path
add the path contributions

That works on a tiny graph. But if many earlier values feed one final scalar output, doing that bookkeeping separately for each value becomes wasteful.

Reverse-mode autodiff avoids that repetition. It starts at the output once, moves backward once, and stores running totals on the nodes.

Why the backward pass starts at the output

Use the same graph:

a = Value(2.0)
b = Value(-3.0)
 
c = a * b
d = a + b
e = c + d

In Lesson 2.3, we found:

\frac{\partial e}{\partial a} = -2, \qquad \frac{\partial e}{\partial b} = 3

To get ∂e/∂a, we traced every path from a to e, multiplied along each path, and added the contributions. To get ∂e/∂b, we did the same kind of bookkeeping again for b.

That is fine for a five-node graph. It is not fine if the graph has tens or hundreds of earlier values and we still care about just one final scalar output. On a larger graph, we keep solving the same path problem over and over:

the graph is the same
the output is the same
only the starting input changes

So instead of starting from each input and asking how it affects e, we start from e and send influence backward once. As that signal moves backward, each node passes local contributions to its parents, and the graph accumulates the total effect on every upstream value at the same time.

That is the computational reason to start at the output. There is also a simple mathematical reason for the first step:

\frac{\partial e}{\partial e} = 1

The output changes one-for-one with itself, so the backward pass begins with e.grad = 1. The output node begins with the amount of influence it already has on itself. From there, each node uses its local rule to pass part of that influence to its parents.

Walk backward in reverse topological order

Now we need an order for the backward pass.

For this graph, e has to go before c and d, because c and d need to receive gradient from e first. So one valid order for the internal nodes is e -> c -> d. You could also swap the order of c and d. They are siblings. What matters is that e is processed before either of them.

A standard name for this processing order is reverse topological order. Operationally, it just means: start at the output, then move backward through valid dependencies.

The output pushes back to its parents

Start with e.grad = 1. The node e was created by the addition e = c + d. For an add node, the local sensitivities are:

\frac{\partial e}{\partial c} = 1, \qquad \frac{\partial e}{\partial d} = 1

So e passes its gradient back like this:

c.\mathrm{grad} \mathrel{+{=}} 1 \cdot e.\mathrm{grad}, \qquad d.\mathrm{grad} \mathrel{+{=}} 1 \cdot e.\mathrm{grad}

Since e.grad = 1, the result is c.grad = 1 and d.grad = 1. At this point the graph state is:

e.grad = 1
c.grad = 1
d.grad = 1
a.grad = 0
b.grad = 0

That is the whole job of e: pass its current influence to its parents.

The multiply node pushes back

Now process c. It was created by the multiplication c = a * b. From Lesson 2.2, its local rules are:

\frac{\partial c}{\partial a} = b, \qquad \frac{\partial c}{\partial b} = a

The backward update is:

a.\mathrm{grad} \mathrel{+{=}} b \cdot c.\mathrm{grad}, \qquad b.\mathrm{grad} \mathrel{+{=}} a \cdot c.\mathrm{grad}

At our current values, a is 2, b is -3, and c.grad is 1, so:

a.\mathrm{grad} \mathrel{+{=}} -3 \cdot 1 = -3, \qquad b.\mathrm{grad} \mathrel{+{=}} 2 \cdot 1 = 2

Now the graph state is:

e.grad = 1
c.grad = 1
d.grad = 1
a.grad = -3
b.grad = 2

Now one branch has written its contribution onto a and b.

The add node pushes back

Now process d. It was created by the addition d = a + b. Its local rules are:

\frac{\partial d}{\partial a} = 1, \qquad \frac{\partial d}{\partial b} = 1

So the backward update is:

a.\mathrm{grad} \mathrel{+{=}} 1 \cdot d.\mathrm{grad}, \qquad b.\mathrm{grad} \mathrel{+{=}} 1 \cdot d.\mathrm{grad}

Since d.grad = 1, this adds 1 to both a.grad and b.grad. Now the graph state becomes:

e.grad = 1
c.grad = 1
d.grad = 1
a.grad = -2
b.grad = 3

These are exactly the totals we computed by hand in Lesson 2.3. The graph did not discover a new answer. It reproduced the same path logic from Lesson 2.3, but it did not make us restart from each input separately.

Why node.grad Matters

Each node needs a place to store the total influence that has reached it so far: node.grad.

At the start, every node's gradient is 0, except the output node, which is seeded with 1. Then each backward step adds new contributions into the totals that are already there.

That is why a.grad ends at -2:

first contribution from c: -3
second contribution from d: +1
total stored on the node: -2

This is Lesson 2.3 stored on a node: multiply within each path, add across paths.

A tiny algorithm sketch

Here is the whole idea in compact form:

topo = [a, b, c, d, e]   # any valid topological order
e.grad = 1
 
for node in reversed(topo):
    node._backward()

topo is the graph listed in forward order, inputs first. reversed(topo) walks that list backward, so the loop visits e before c and d, then reaches a and b last.

The line node._backward() means: use this node's local rule, push contributions to its parents, and add those contributions into parent.grad.

That short loop works because the derivative logic was already broken into local rules.

Tiny checkpoint

Use the same structure with new values:

x = Value(4.0)
y = Value(-1.0)
 
p = x * y
q = x + y
r = p + q

Answer before revealing:

What is the starting gradient?
After processing r, what are p.grad and q.grad?
After processing p, what contributions reach x and y?
After processing q, what are the final values of x.grad and y.grad?

Reveal answers

r.grad = 1.
p.grad = 1, q.grad = 1.
x gets -1, y gets 4.
x.grad = 0, y.grad = 5.

The important check is x.grad = 0.

The multiply branch contributes -1. The add branch contributes +1. The stored total becomes 0.