Skip to content
  1. 2.1
  2. 2.2
  3. 2.3
  4. 2.4
  5. 2.5
Step 1 of 9~11 min left

Chapter 2 · Lesson 2.2

Local derivatives

A graph tells us what depends on what. A local derivative tells us how strongly one direct dependency matters.

Lesson 2.1 showed that ordinary code can build a graph of dependencies. This lesson adds the next piece: each operation in that graph carries its own local rule.

If this direct input moves a little, how does this operation's output move?

That is the question a local derivative answers. Later lessons will connect these local facts across longer paths. For now, we stay local.

The graph gives us edges; derivatives give those edges meaning

Consider the graph from the previous lesson:

a = Value(2.0)
b = Value(-3.0)
 
c = a * b
d = a + b
e = c + d

The graph tells us that c depends on a and b, but not how strongly. For c = a * b, if a changes a tiny bit, what happens to c? What if b changes instead? Those are derivative questions, and each edge into c carries its own local sensitivity.

We are not yet asking how a affects e. We are only asking how a and b affect c directly.

abc = a · b

Sensitivity means "what happens if I nudge this?"

A derivative is often introduced as a formula. Here the simpler intuition is enough: a derivative is local sensitivity. Suppose c = a * b with a at 2.0 and b at -3.0, so c is -6.0. Now nudge a slightly upward and watch what happens to c:

Run mepythonIn [0]
OutputOut []

Click Run to execute this cell. Output appears here.

So a increased by 0.001 and c decreased by 0.003. With b fixed at -3, the relation becomes c = -3a. That makes c a straight-line function of a. Every change in a gets scaled by exactly -3, whether the step is small or large. In this example, the derivative is not just a local approximation. It gives the exact change in c for any step in a. So the local derivative of c with respect to a is:

ca=b=3\frac{\partial c}{\partial a} = b = -3

A derivative is a local conversion rate. In this example, that local rate is exact because the dependence on a is linear.

Addition is the pass-through operation

Start with the simplest local rule, d = a + b. If a increases by a tiny amount, d increases by the same amount. If b increases by a tiny amount, d also increases by the same amount.

da=1,db=1\frac{\partial d}{\partial a} = 1, \qquad \frac{\partial d}{\partial b} = 1

Addition passes sensitivity through unchanged. If a changes by +0.001, then d changes by +0.001. If b changes by +0.001, then d also changes by +0.001. A local derivative of 1 means the output changes one-for-one with that direct input.

Multiplication swaps the other input into the local rule

Multiplication is only slightly more interesting. For c = a * b, the local derivatives are:

ca=b,cb=a\frac{\partial c}{\partial a} = b, \qquad \frac{\partial c}{\partial b} = a

The derivative with respect to one input is the other input. At a = 2.0 and b = -3.0:

ca=3,cb=2\frac{\partial c}{\partial a} = -3, \qquad \frac{\partial c}{\partial b} = 2

So near this point:

  • nudging a upward pushes c downward
  • nudging b upward pushes c upward
  • a's local effect is scaled by b
  • b's local effect is scaled by a

Power, exp, and log are just more local rules

Addition and multiplication show the pattern. Power, exp, and log follow it too: each operation supplies its own direct sensitivity rule.

For power, y = x ** n with fixed n, the local derivative is:

yx=nxn1\frac{\partial y}{\partial x} = n \cdot x^{n-1}

At x = 3.0 with y = x ** 2, the local derivative is 6.0. Near x = 3, a tiny movement in x gets scaled by about 6.

For y = exp(x), the local derivative equals the output itself:

yx=exp(x)\frac{\partial y}{\partial x} = \exp(x)

For y = log(x), the local derivative is:

yx=1x\frac{\partial y}{\partial x} = \frac{1}{x}

For log(x), the input must be positive. If x is 0 or negative, this operation is not valid. That restriction is part of what log means, not an extra technicality added later.

A derivative can be exact or approximate

Usually, a derivative tells us how the output is changing at the starting point. Multiply that derivative by a step size, and you get a prediction for how much the output should change.

Sometimes that prediction is exact. That happened in our multiplication example. With b fixed at -3, the relation is c = -3a. This is a straight line in a, so the rate of change stays -3 everywhere. If a increases by 1, then c changes by exactly -3. If a increases by 0.001, then c changes by exactly -0.003.

But not every function is a straight line. For y = x ** 2, the derivative at x = 2 is:

yx=4\frac{\partial y}{\partial x} = 4

If we call the step size h, then the actual change is:

Δy=(2+h)222=4h+h2\Delta y = (2 + h)^2 - 2^2 = 4h + h^2

The derivative predicts only the linear part near the starting point:

Δy4h\Delta y \approx 4h

If x increases by 0.01, the derivative predicts 0.04 while the actual change is 0.0401. That is close, but not exact.

If x increases by 1, the derivative predicts 4 while the actual change is 5. Now the miss is larger.

That is the distinction to keep straight:

  • if the output depends on the input like a straight line, derivative times step gives the exact change
  • if the function bends, derivative times step is a local approximation and works best for small steps

A node does not need to understand the whole graph

This is the core design principle. Look at the multiplication node c = a * b. It only needs two local facts: dc/da = b and dc/db = a.

It only needs its own local rule, not the whole graph. That is why the Value object can stay small. It needs:

  • the forward value
  • the parent links
  • the operation that created it
  • the local rule for that operation

Forward value and backward influence are different things

This is the confusion to kill early. When we compute c = a * b, we get a forward value, c.data = -6.0. The local derivatives are different facts:

ca=3,cb=2\frac{\partial c}{\partial a} = -3, \qquad \frac{\partial c}{\partial b} = 2

They are sensitivity rules attached to the operation, not extra forward outputs. data answers what value this node produced. The local derivative answers how this node's output would respond if one direct input moved slightly. These are different questions.

Tiny checkpoint

Consider these direct operations:

a = Value(2.0)
b = Value(-3.0)
 
c = a + b
d = a * b
p = a ** 2

Answer before revealing:

  1. What is c.data?
  2. What is the local derivative dc/da?
  3. What is the local derivative dc/db?
  4. What is d.data?
  5. What is the local derivative dd/da?
  6. What is the local derivative dd/db?
  7. What is p.data?
  8. What is the local derivative dp/da?
Reveal answers
  1. c.data = -1.0
  2. dc/da = 1.0
  3. dc/db = 1.0
  4. d.data = -6.0
  5. dd/da = b = -3.0
  6. dd/db = a = 2.0
  7. p.data = 4.0
  8. dp/da = 2 * a = 4.0

The point is not memorizing the table. The point is seeing the pattern: each operation has a local rule that describes direct sensitivity at that operation.