#basic imports
import math
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
Spelled-out intro to neural networks and backpropagation
A blog implementation of Andrej Karpathy’s micrograd video
Contents:
1. Derivative of a function
2. Core Value Object
3. Manual Backpropagation
4. Backward function
5. More operations
6. Pytorch
Lets get an intuitive understanding of what a derivative is
Lets use a simple quadratic function as an example
\(f(x) = 3x^2 - 4x + 5\)
def f(x):
return 3*x**2 - 4*x + 5
3.0) f(
20.0
Take a look at the shape of the function, we can expect a parabola since we know its a quadratic function
# creating a range of x values from -5 to 5 incrementing by 0.25 to pass into our function to generate our y values
= np.arange(-5, 5, 0.25)
xs = f(xs)
ys plt.plot(xs, ys)
We know want to think through what is the derivative of this function at different points x, let refresh with the definition of a derivative
\(f'(x) = lim_{h \to 0} \frac{f(x+h)-f(x)}{h}\)
You are basically trying to see the level of sensitivty the function responds with by bumping any x value at any point slightly by this small number h
Intuitively how would you expect this function to respond if we nudged x = 3.0 by this small postitive number h? The amount the x value responds tells you the strength of the slope
= 0.0001
h = 3.0
x print(f'slope of function at x = {x}, slope = {(f(x + h) - f(x)) / h}')
slope of function at x = 3.0, slope = 14.000300000063248
Lets do a hacky implementation with more variables
Look at the function a*b + c in relation to the variables we assigned, imagine if you nudged each variables by a tiny amount would that result in our output being increased or decreased?
If we were to slightly nudge each of our input varibles by the tiny amount h(amount approaching 0) we can approximate the instataneous rate of change by looking at the difference before and after over the amount we nudged by, this will give us the slope.
# lets get more complex
= 0.0001
h #inputs
= 2.0
a = -3.0
b = 10.0
c
#We wanna find the derivative of d with respect to a,b,c
= a*b + c
d1 += h
a = a*b + c
d2 print('d1', d1)
print('d2', d2)
print('slope', (d2 - d1)/h)
d1 4.0
d2 3.999699999999999
slope -3.000000000010772
Lets do it with b now
= a*b + c
d1 += h
b = a*b + c
d2 print('d1', d1)
print('d2', d2)
print('slope', (d2 - d1)/h)
d1 3.999699999999999
d2 3.99990001
slope 2.0001000000124947
And c…
= a*b + c
d1 += h
c = a*b + c
d2 print('d1', d1)
print('d2', d2)
print('slope', (d2 - d1)/h)
d1 3.99990001
d2 4.00000001
slope 0.9999999999976694
Hopefully this has helped build an inuitive sense of what this derivative is telling you about the function, but now we want to move to neural networks, which will be massive mathmatical expressions, so we need some structures to maintain these expressions, so we will build out a value object that can keep track of state and allow us to do expressions
Core Value Object
class Value:
def __init__(self, data):
self.data = data
def __repr__(self):
return f"Value(data={self.data})"
def __add__(self, other):
= Value(self.data + other.data)
out return out
def __mul__(self, other):
= Value(self.data * other.data)
out return out
= Value(2.0)
a = Value(-3.0)
b + b
a
* b a
Value(data=-6.0)
We use double underscore or dunder methods so python knows what to interally when we use operators such as print, +, -, *
So when we call a + b
above what is happening interally is a.__add__(b)
with self as a and b as other. Similarly when we do a * b
, python is calling a.__mul__(b)
Cool so now we can do basic arthmetic now we need to add pointers to what see values produce other values and by what operations
class Value:
def __init__(self, data, _children=(), _op='', label=''):
self.data = data
self._prev = set(_children)
self._op = _op
self.label = label
self.grad = 0.0
def __repr__(self):
return f"Value(data={self.data})"
def __add__(self, other):
= Value(self.data + other.data, (self, other), '+')
out return out
def __mul__(self, other):
= Value(self.data * other.data, (self, other), '*')
out return out
= Value(2.0, label='a')
a = Value(-3.0, label='b')
b = Value(10.0, label='c')
c = a*b; e.label = 'e'
e = e + c; d.label = 'd'
d = Value(-2.0, label='f')
f = d * f; L.label = 'L'
L L
Value(data=-8.0)
Now that we have the a way to store the parents, labels and operations we can visualize them with an expression graph, code below allows us to do that using a library called graphviz
from graphviz import Digraph
def trace(root):
# builds a set of all nodes and edges in a graph
= set(), set()
nodes, edges def build(v):
if v not in nodes:
nodes.add(v)for child in v._prev:
edges.add((child, v))
build(child)
build(root)return nodes, edges
def draw_dot(root):
= Digraph(format='svg', graph_attr={'rankdir': 'LR'}) # LR = left to right
dot
= trace(root)
nodes, edges for n in nodes:
= str(id(n))
uid # for any value in the graph, create a rectangular ('record') node for it
= uid, label = "{ %s | data %.4f | grad %.4f}" % (n.label, n.data, n.grad), shape='record')
dot.node(name if n._op:
# if this value is a result of some operation, create an op node for it
= uid + n._op, label = n._op)
dot.node(name # and connect this node to it
+ n._op, uid)
dot.edge(uid
for n1, n2 in edges:
# connect n1 to the op node of n2
str(id(n1)), str(id(n2)) + n2._op)
dot.edge(
return dot
draw_dot(L)
Lets recap: - we are able to build out mathematical expressions using + and * - Added and tracked grad so we can calculate and update this state later when we do backpropagtion - Forward pass that produces output L and visualized - Now we want to do backpropagation
Manual Backpropagation
Lets manually nudge the variable a and manually calculate the derivative of L with respect to a, lets create a gating function lol so we don’t pollute the global scope. We can do this for each variable to calculate their derivative with respect to L
\(L = d * f\)
\(\frac{dL}{dd} =? f\)
\(lim_{h \to 0} \frac{(d+h)*f - d*f}{h}\)
\(lim_{h \to 0} \frac{d*f + h*f - d*f}{h}\)
\(lim_{h \to 0} \frac{h*f}{h}\)
\(f\)
So we can see that d.grad is just the value of f which is -2.0 and by the property of symmetry f.grad is just the value of d which is 4.0, lets go ahead and manually set these
= 4.0
f.grad = -2.0 d.grad
= 1 L.grad
def lol():
= 0.0001
h
= Value(2.0, label='a')
a = Value(-3.0, label='b')
b = Value(10.0, label='c')
c = a*b; e.label = 'e'
e = e + c; d.label = 'd'
d = Value(-2.0, label='f')
f = d * f; L.label = 'L'
L = L.data
L1
# this is the variable we are nudging by h
= Value(2.0 , label='a')
a = Value(-3.0, label='b')
b = Value(10.0, label='c')
c = a*b; e.label = 'e'
e = e + c; d.label = 'd'
d += h
d.data = Value(-2.0, label='f')
f = d * f; L.label = 'L'
L = L.data
L2
print((L2-L1)/h)
lol()
-1.9999999999953388
So we have just derived the derivates of f and d with respect to L in a step by step way, now next will uncover the core of backpropagation, we want derive the derivative of L with respect to c and e.
We now know how L is sensitive to d and we know how e and c are sensitive to d, we can know put that together to figure out how L is sensitive to e and c.
If a variable z depends on the variable y, which itself depends on the variable x (that is, y and z are dependent variables), then z depends on x as well, via the intermediate variable y. In this case, the chain rule is expressed as
\(\frac{dz}{dx} = \frac{dz}{dy} * \frac{dy}{dx}\)
The chain rule is fundamentally telling you how we chain these derivatives together correctly so to differentiate through a function composition we have to apply a multiplication of those derivatives
The inuitive explanation here is that knowing the instantaneous rate of change of z with respect to y and y relative to x allows one to calculate the instantaneous rate of change of z
If a car travels twice as fast as a bicyle and the cycle is four times as fast as a walking man then the car is 2 * 4 = 8 times faster than the man
We know the derivative of \(\frac{dL}{dd}\) and \(\frac{dd}{dc}\) and want to find \(\frac{dL}{dc}\) the chain rule tells us that \(\frac{dL}{dc} = \frac{dL}{dd} * \frac{dd}{dc}\)
\(1.0 * -2.0 = -2.0\)
The chain rule is telling us for plus nodes “+” we are just routing the gradient because the local derivative is just 1.0
= -2.0
e.grad = -2.0 c.grad
draw_dot(L)
Lets manually check our work by nudging c
def lol():
= 0.0001
h
= Value(2.0, label='a')
a = Value(-3.0, label='b')
b = Value(10.0, label='c')
c = a*b; e.label = 'e'
e = e + c; d.label = 'd'
d = Value(-2.0, label='f')
f = d * f; L.label = 'L'
L = L.data
L1
= Value(2.0 , label='a')
a = Value(-3.0, label='b')
b # this is the variable we are nudging by h
= Value(10.0, label='c')
c += h
c.data = a*b; e.label = 'e'
e = e + c; d.label = 'd'
d = Value(-2.0, label='f')
f = d * f; L.label = 'L'
L = L.data
L2
print((L2-L1)/h)
lol()
-1.9999999999953388
As we expected c.grad equals -2.0*
Now we will recurse our way backwards again and going to do our second application of the chain rule
\(\frac{dL}{de} = -2.0\)
\(\frac{de}{da} = b\)
\(\frac{dL}{da} = \frac{dL}{de} * \frac{de}{da}\)
We are multiplying the derivative of e with respect to L with the local gradients
= -2.0 * -3.0
a.grad = -2.0 * 2.0 b.grad
draw_dot(L)
Lets verify
def lol():
= 0.0001
h
= Value(2.0, label='a')
a = Value(-3.0, label='b')
b = Value(10.0, label='c')
c = a*b; e.label = 'e'
e = e + c; d.label = 'd'
d = Value(-2.0, label='f')
f = d * f; L.label = 'L'
L = L.data
L1
# this is the variable we are nudging by h
= Value(2.0 , label='a')
a += h
a.data = Value(-3.0, label='b')
b = Value(10.0, label='c')
c = a*b; e.label = 'e'
e = e + c; d.label = 'd'
d = Value(-2.0, label='f')
f = d * f; L.label = 'L'
L = L.data
L2
print((L2-L1)/h)
lol()
6.000000000021544
Checks out
We know know what back propagation is; a recursive application of the chain rule backwards through the computational graph
Neuron Example
One step optimization
+= 0.01 * a.grad
a.data += 0.01 * b.grad
b.data += 0.01 * c.grad
c.data += 0.01 * f.grad
f.data
= a * b
e = e + c
d = d * f
L
print(L.data)
-7.286496
For our model of neurons we have input axis and these synapses that have weights on them so the w’s are the weights and then the synapse interacts with the input multiplicatively so what flows to the cell body of this neuron is w times x but there’s multiple inputs so there’s many w times x’s flowing into the cell body, the cell body also has some bias which is a sort of trigger happiness of this neuron, making it more or less prone to firing. Then we take it through an activation function which is generally some kinda of squashing function like a sigmoid or tanh. Lets go over an example of a tanh activation function
-5, 5, 0.2), np.tanh(np.arange(-5, 5, 0.2))); plt.grid(); plt.plot(np.arange(
You can see that the inputs that come in get squashed here on the y axis, the function gets capped at 1.00 and -1.00
# inputs x1,x2
= Value(2.0, label='x1')
x1 = Value(0.0, label='x2')
x2 # weights w1,w2
= Value(-3.0, label='w1')
w1 = Value(1.0, label='w2')
w2 # bias of the neuron
#6.8813735870195432
= Value(6.8813735870195432, label='b')
b # x1*w1 + x2*w2 + b
= x1*w1; x1w1.label = 'x1*w1'
x1w1 = x2*w2; x2w2.label = 'x2*w2'
x2w2 = x1w1 + x2w2; x1w1x2w2.label = 'x1*w1 + x2*w2'
x1w1x2w2 = x1w1x2w2 + b; n.label = 'n'
n draw_dot(n)
We need to add more operations to our Value class to be able to calculate our activation function tanh, lets just do a cheeky implementation of tanh on our value class for now
class Value:
def __init__(self, data, _children=(), _op='', label=''):
self.data = data
self._prev = set(_children)
self._op = _op
self.label = label
self.grad = 0.0
def __repr__(self):
return f"Value(data={self.data})"
def __add__(self, other):
= Value(self.data + other.data, (self, other), '+')
out return out
def __mul__(self, other):
= Value(self.data * other.data, (self, other), '*')
out return out
def tanh(self):
= self.data
x = (math.exp(2*x) - 1)/(math.exp(2*x) + 1)
t = Value(t, (self, ), 'tanh')
out return out
# inputs x1,x2
= Value(2.0, label='x1')
x1 = Value(0.0, label='x2')
x2 # weights w1,w2
= Value(-3.0, label='w1')
w1 = Value(1.0, label='w2')
w2 # bias of the neuron
#6.8813735870195432
= Value(6.8813735870195432, label='b')
b # x1*w1 + x2*w2 + b
= x1*w1; x1w1.label = 'x1*w1'
x1w1 = x2*w2; x2w2.label = 'x2*w2'
x2w2 = x1w1 + x2w2; x1w1x2w2.label = 'x1*w1 + x2*w2'
x1w1x2w2 = x1w1x2w2 + b; n.label = 'n'
n draw_dot(n)
= n.tanh(); o.label = 'o' o
draw_dot(o)
Awesome n goes through tanh to produce the last output, our activation function is working great, now all we need to know is the derivative of tanh and we can use backpropagation.
= 1.0 o.grad
Lets calculte the gradient of n
1 - o.data**2
0.4999999999999999
= 0.5 n.grad
Now we can easily get the gradients for x1w1x2w1, b, x1w1, x2w2 since we used addition as an operation the local derivatives are just 1 so we just take the value 0.5
= 0.5
x1w1x2w2.grad = 0.5
b.grad = 0.5
x1w1.grad = 0.5 x2w2.grad
We can know calculate the gradients for x2, w2, x1, and w1, but unlike the last gradients we used multiplication as our operation, so our local derivative is just the other term used in the operation so lets calculate the gradients
= w2.data * x2w2.grad
x2.grad = x2.data * x2w2.grad
w2.grad = w1.data * x1w1.grad
x1.grad = x1.data * x1w1.grad w1.grad
draw_dot(o)
Nice!, we have manually used backpropagation to calculate our gradients, now lets implement a backward function for each operation
Backward Function
Lets add a backward methods to our Value object for each operation that we can call to calculate our gradients for us using backpropagation
class Value:
def __init__(self, data, _children=(), _op='', label=''):
self.data = data
self._prev = set(_children)
self._op = _op
self.label = label
self.grad = 0.0
self._backward = lambda: None
def __repr__(self):
return f"Value(data={self.data})"
def __add__(self, other):
= Value(self.data + other.data, (self, other), '+')
out
def _backward():
self.grad += 1.0 * out.grad
+= 1.0 * out.grad
other.grad = _backward
out._backward
return out
def __mul__(self, other):
= Value(self.data * other.data, (self, other), '*')
out
def _backward():
self.grad += other.data * out.grad
+= self.data * out.grad
other.grad = _backward
out._backward
return out
def tanh(self):
= self.data
x = (math.exp(2*x) - 1)/(math.exp(2*x) + 1)
t = Value(t, (self, ), 'tanh')
out
def _backward():
self.grad += (1 - t**2) * out.grad
= _backward
out._backward
return out
# inputs x1,x2
= Value(2.0, label='x1')
x1 = Value(0.0, label='x2')
x2 # weights w1,w2
= Value(-3.0, label='w1')
w1 = Value(1.0, label='w2')
w2 # bias of the neuron
#6.8813735870195432
= Value(6.8813735870195432, label='b')
b # x1*w1 + x2*w2 + b
= x1*w1; x1w1.label = 'x1*w1'
x1w1 = x2*w2; x2w2.label = 'x2*w2'
x2w2 = x1w1 + x2w2; x1w1x2w2.label = 'x1*w1 + x2*w2'
x1w1x2w2 = x1w1x2w2 + b; n.label = 'n'
n = n.tanh(); o.label = 'o'
o draw_dot(o)
We initialize the gradient of o to 1.0, then call _backward to recursively calculate the gradients
= 1.0 o.grad
o._backward()
draw_dot(o)
n._backward()
b._backward()
x1w1x2w2._backward()
x2w2._backward() x1w1._backward()
draw_dot(o)
Awesome, this works great, we never want to call backward() on a node before we have calculated the gradients for everthing after it since it depends on their gradients. We will use topological sort which lays the graph such that all the edges go only from left to right.
# inputs x1,x2
= Value(2.0, label='x1')
x1 = Value(0.0, label='x2')
x2 # weights w1,w2
= Value(-3.0, label='w1')
w1 = Value(1.0, label='w2')
w2 # bias of the neuron
#6.8813735870195432
= Value(6.8813735870195432, label='b')
b # x1*w1 + x2*w2 + b
= x1*w1; x1w1.label = 'x1*w1'
x1w1 = x2*w2; x2w2.label = 'x2*w2'
x2w2 = x1w1 + x2w2; x1w1x2w2.label = 'x1*w1 + x2*w2'
x1w1x2w2 = x1w1x2w2 + b; n.label = 'n'
n = n.tanh(); o.label = 'o'
o draw_dot(o)
# topological sort
= []
topo = set()
visited def build_topo(v):
if v not in visited:
visited.add(v)for child in v._prev:
build_topo(child)
topo.append(v)
build_topo(o) topo
[Value(data=1.0),
Value(data=0.0),
Value(data=0.0),
Value(data=2.0),
Value(data=-3.0),
Value(data=-6.0),
Value(data=-6.0),
Value(data=6.881373587019543),
Value(data=0.8813735870195432),
Value(data=0.7071067811865476)]
Our Value objects are now ordered properly
= 1.0 o.grad
for node in reversed(topo):
node._backward()
draw_dot(o)
# Lets implement this as a method in the Value object and add more operations
class Value:
def __init__(self, data, _children=(), _op='', label=''):
self.data = data
self._prev = set(_children)
self._op = _op
self.label = label
self.grad = 0.0
self._backward = lambda: None
def __repr__(self):
return f"Value(data={self.data})"
def __add__(self, other):
= other if isinstance(other, Value) else Value(other)
other = Value(self.data + other.data, (self, other), '+')
out
def _backward():
self.grad += 1.0 * out.grad
+= 1.0 * out.grad
other.grad = _backward
out._backward
return out
def __mul__(self, other):
= other if isinstance(other, Value) else Value(other)
other = Value(self.data * other.data, (self, other), '*')
out
def _backward():
self.grad += other.data * out.grad
+= self.data * out.grad
other.grad = _backward
out._backward
return out
def __rmul__(self, other):
return self * other
def __truediv__(self, other):
return self * other**-1
def __neg__(self): # -self
return self * -1
def __sub__(self, other): # self - other
return self + (-other)
def __radd__(self, other): # other + self
return self + other
def __pow__(self, other):
assert isinstance(other, (int, float)), "only supporting int/float powers for now"
= Value(self.data**other, (self,), f'**{other}')
out
def _backward():
self.grad += other * (self.data ** (other - 1)) * out.grad
= _backward
out._backward
return out
def tanh(self):
= self.data
x = (math.exp(2*x) - 1)/(math.exp(2*x) + 1)
t = Value(t, (self, ), 'tanh')
out
def _backward():
self.grad += (1 - t**2) * out.grad
= _backward
out._backward
return out
def exp(self):
= self.data
x = Value(math.exp(x), (self, ), 'exp')
out
def _backward():
self.grad += out.data * out.grad
= _backward
out._backward
return out
# Now we can call backward on our Value object
def backward(self):
= []
topo = set()
visited def build_topo(v):
if v not in visited:
visited.add(v)for child in v._prev:
build_topo(child)
topo.append(v)self)
build_topo(
self.grad = 1.0
for node in reversed(topo):
node._backward()
= Value(2.0)
a = Value(4.0)
b - b a
Value(data=-2.0)
= Value(2.0)
a a.exp()
Value(data=7.38905609893065)
# inputs x1,x2
= Value(2.0, label='x1')
x1 = Value(0.0, label='x2')
x2 # weights w1,w2
= Value(-3.0, label='w1')
w1 = Value(1.0, label='w2')
w2 # bias of the neuron
= Value(6.8813735870195432, label='b')
b # x1*w1 + x2*w2 + b
= x1*w1; x1w1.label = 'x1*w1'
x1w1 = x2*w2; x2w2.label = 'x2*w2'
x2w2 = x1w1 + x2w2; x1w1x2w2.label = 'x1*w1 + x2*w2'
x1w1x2w2 = x1w1x2w2 + b; n.label = 'n'
n = n.tanh(); o.label = 'o'
o
o.backward() draw_dot(o)
# inputs x1,x2
= Value(2.0, label='x1')
x1 = Value(0.0, label='x2')
x2 # weights w1,w2
= Value(-3.0, label='w1')
w1 = Value(1.0, label='w2')
w2 # bias of the neuron
= Value(6.8813735870195432, label='b')
b # x1*w1 + x2*w2 + b
= x1*w1; x1w1.label = 'x1*w1'
x1w1 = x2*w2; x2w2.label = 'x2*w2'
x2w2 = x1w1 + x2w2; x1w1x2w2.label = 'x1*w1 + x2*w2'
x1w1x2w2 = x1w1x2w2 + b; n.label = 'n'
n # ----
= (2*n).exp()
e = (e - 1) / (e + 1)
o # ----
= 'o'
o.label
o.backward() draw_dot(o)
Pytorch comparison
import torch
= torch.Tensor([2.0]).double() ; x1.requires_grad = True
x1 = torch.Tensor([0.0]).double() ; x2.requires_grad = True
x2 = torch.Tensor([-3.0]).double() ; w1.requires_grad = True
w1 = torch.Tensor([1.0]).double() ; w2.requires_grad = True
w2 = torch.Tensor([6.8813735870195432]).double() ; b.requires_grad = True
b = x1*w1 + x2*w2 + b
n = torch.tanh(n)
o
print(o.data.item())
o.backward()
print('---')
print('x2', x2.grad.item())
print('w2', w2.grad.item())
print('x1', x1.grad.item())
print('w1', w1.grad.item())
0.7071066904050358
---
x2 0.5000001283844369
w2 0.0
x1 -1.5000003851533106
w1 1.0000002567688737
o
tensor([0.7071], dtype=torch.float64, grad_fn=<TanhBackward0>)
o.item()
0.7071066904050358
import random
class Neuron:
def __init__(self, nin):
self.w = [Value(random.uniform(-1,1)) for _ in range(nin)]
self.b = Value(random.uniform(-1,1))
def __call__(self, x):
# w * x + b
= sum((wi*xi for wi, xi in zip(self.w, x)), self.b)
act = act.tanh()
out return out
def parameters(self):
return self.w + [self.b]
class Layer:
def __init__(self, nin, nout):
self.neurons = [Neuron(nin) for _ in range(nout)]
def __call__(self, x):
= [n(x) for n in self.neurons]
outs return outs[0] if len(outs) == 1 else outs
def parameters(self):
return [p for neuron in self.neurons for p in neuron.parameters()]
# params = []
# for neuron in self.neurons:
# ps = neuron.parameters()
# params.extend(ps)
# return params
class MLP:
def __init__(self, nin, nouts):
= [nin] + nouts
sz self.layers = [Layer(sz[i], sz[i+1]) for i in range(len(nouts))]
def __call__(self, x):
for layer in self.layers:
= layer(x)
x return x
def parameters(self):
return [p for layer in self.layers for p in layer.parameters()]
= [2.0, 3.0, -1.0]
x = MLP(3, [4, 4, 1])
n n(x)
Value(data=-0.7028959990425087)
len(n.parameters())
41
= [
xs 2.0, 3.0, -1.0],
[3.0, -1.0, 0.5],
[0.5, 1.0, 1.0],
[1.0, 1.0, -1.0]
[
]= [1.0, -1.0, -1.0, 1.0]
ys = [n(x) for x in xs]
ypred ypred
[Value(data=-0.7028959990425087),
Value(data=0.1757758058598642),
Value(data=0.39624177470697325),
Value(data=0.16264413106842704)]
So how do we tune the weights to better predict the desired targets? We calculate a single number that measures the total performance of the neural net, this is called the loss.
= sum((yout - ygt)**2 for ygt, yout in zip(ys, ypred))
loss loss
Value(data=6.932959473871423)
loss.backward()
Now that we have called backward on the loss, we can take a look at the gradient of a single neuron in one of our layers to get a look on how it impacts our loss, this will be useful soon when we try to update our weights to decrease the loss
0].neurons[0].w[0].data n.layers[
0.28438888706081467
The gradient for this neuron is positive, so the weights is increasing our loss
0].neurons[0].w[0].grad n.layers[
1.711566891777295
0].neurons[0].b.grad n.layers[
1.0849775541180198
We can now also call draw_dot on our loss, we can see the DAG has increased tremendously in compelexity
draw_dot(loss)
We can update our parameters by multiplying by substracting them by their gradients multiplied by a learning rate (the desired impact of the gradient has in updating our parameters, here we use 0.01 arbitrarily, but there are many techniques to find an optimal learning rate and to decay the rate as training continues, 0.1 is also a good rule of thumb, too big and you can overstep too small and its costly to train
for p in n.parameters():
-= 0.01 * p.grad p.data
0].neurons[0].w[0].data n.layers[
0.2672732181430417
After updating our parameters we can confirm our loss decreased
= [n(x) for x in xs]
ypred = sum((yout - ygt)**2 for ygt, yout in zip(ys, ypred))
loss loss
Value(data=6.317575661771169)
What we have done is gradient decent. Forward pass -> backward pass -> update the parameters. Now we just have to iterate this process, lets turn this manual process into a training loop
for k in range(10):
# forward pass
= [n(x) for x in xs]
ypred = sum((yout - ygt)**2 for ygt, yout in zip(ys, ypred))
loss
#backward pass
loss.backward()
#update
for p in n.parameters():
-= 0.05 * p.grad
p.data = 0
p.grad print(k, loss.data)
0 0.10345239945960188
1 0.0941558588278075
2 0.0862952014023131
3 0.07956846088984432
4 0.07375215882362336
5 0.06867739459504289
6 0.06421419054843136
7 0.06026093580621215
8 0.056737086525177276
9 0.05357800634860667
ypred
[Value(data=0.8741761793258574),
Value(data=-0.8786120732704685),
Value(data=-0.8952118169896238),
Value(data=0.8903150856076475)]