```
#basic imports
import math
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
```

Spelled-out intro to neural networks and backpropagation

### A blog implementation of Andrej Karpathyâ€™s micrograd video

# Contents:

### 1. Derivative of a function

### 2. Core Value Object

### 3. Manual Backpropagation

### 4. Backward function

### 5. More operations

### 6. Pytorch

# Lets get an intuitive understanding of what a derivative is

Lets use a simple quadratic function as an example

## \(f(x) = 3x^2 - 4x + 5\)

```
def f(x):
return 3*x**2 - 4*x + 5
```

`3.0) f(`

`20.0`

Take a look at the shape of the function, we can expect a parabola since we know its a quadratic function

```
# creating a range of x values from -5 to 5 incrementing by 0.25 to pass into our function to generate our y values
= np.arange(-5, 5, 0.25)
xs = f(xs)
ys plt.plot(xs, ys)
```

We know want to think through what is the derivative of this function at different points x, let refresh with the definition of a derivative

## \(f'(x) = lim_{h \to 0} \frac{f(x+h)-f(x)}{h}\)

You are basically trying to see the level of sensitivty the function responds with by bumping any x value at any point slightly by this small number h

Intuitively how would you expect this function to respond if we nudged x = 3.0 by this small postitive number h? The amount the x value responds tells you the strength of the slope

```
= 0.0001
h = 3.0
x print(f'slope of function at x = {x}, slope = {(f(x + h) - f(x)) / h}')
```

`slope of function at x = 3.0, slope = 14.000300000063248`

Lets do a hacky implementation with more variables

Look at the function a*b + c in relation to the variables we assigned, imagine if you nudged each variables by a tiny amount would that result in our output being increased or decreased?

If we were to slightly nudge each of our input varibles by the tiny amount h(amount approaching 0) we can approximate the instataneous rate of change by looking at the difference before and after over the amount we nudged by, this will give us the slope.

```
# lets get more complex
= 0.0001
h #inputs
= 2.0
a = -3.0
b = 10.0
c
#We wanna find the derivative of d with respect to a,b,c
= a*b + c
d1 += h
a = a*b + c
d2 print('d1', d1)
print('d2', d2)
print('slope', (d2 - d1)/h)
```

```
d1 4.0
d2 3.999699999999999
slope -3.000000000010772
```

Lets do it with b now

```
= a*b + c
d1 += h
b = a*b + c
d2 print('d1', d1)
print('d2', d2)
print('slope', (d2 - d1)/h)
```

```
d1 3.999699999999999
d2 3.99990001
slope 2.0001000000124947
```

And câ€¦

```
= a*b + c
d1 += h
c = a*b + c
d2 print('d1', d1)
print('d2', d2)
print('slope', (d2 - d1)/h)
```

```
d1 3.99990001
d2 4.00000001
slope 0.9999999999976694
```

Hopefully this has helped build an inuitive sense of what this derivative is telling you about the function, but now we want to move to neural networks, which will be massive mathmatical expressions, so we need some structures to maintain these expressions, so we will build out a value object that can keep track of state and allow us to do expressions

# Core Value Object

```
class Value:
def __init__(self, data):
self.data = data
def __repr__(self):
return f"Value(data={self.data})"
def __add__(self, other):
= Value(self.data + other.data)
out return out
def __mul__(self, other):
= Value(self.data * other.data)
out return out
= Value(2.0)
a = Value(-3.0)
b + b
a
* b a
```

`Value(data=-6.0)`

We use double underscore or dunder methods so python knows what to interally when we use operators such as print, +, -, *

So when we call `a + b`

above what is happening interally is `a.__add__(b)`

with self as a and b as other. Similarly when we do `a * b`

, python is calling `a.__mul__(b)`

Cool so now we can do basic arthmetic now we need to add pointers to what see values produce other values and by what operations

```
class Value:
def __init__(self, data, _children=(), _op='', label=''):
self.data = data
self._prev = set(_children)
self._op = _op
self.label = label
self.grad = 0.0
def __repr__(self):
return f"Value(data={self.data})"
def __add__(self, other):
= Value(self.data + other.data, (self, other), '+')
out return out
def __mul__(self, other):
= Value(self.data * other.data, (self, other), '*')
out return out
= Value(2.0, label='a')
a = Value(-3.0, label='b')
b = Value(10.0, label='c')
c = a*b; e.label = 'e'
e = e + c; d.label = 'd'
d = Value(-2.0, label='f')
f = d * f; L.label = 'L'
L L
```

`Value(data=-8.0)`

Now that we have the a way to store the parents, labels and operations we can visualize them with an expression graph, code below allows us to do that using a library called graphviz

```
from graphviz import Digraph
def trace(root):
# builds a set of all nodes and edges in a graph
= set(), set()
nodes, edges def build(v):
if v not in nodes:
nodes.add(v)for child in v._prev:
edges.add((child, v))
build(child)
build(root)return nodes, edges
def draw_dot(root):
= Digraph(format='svg', graph_attr={'rankdir': 'LR'}) # LR = left to right
dot
= trace(root)
nodes, edges for n in nodes:
= str(id(n))
uid # for any value in the graph, create a rectangular ('record') node for it
= uid, label = "{ %s | data %.4f | grad %.4f}" % (n.label, n.data, n.grad), shape='record')
dot.node(name if n._op:
# if this value is a result of some operation, create an op node for it
= uid + n._op, label = n._op)
dot.node(name # and connect this node to it
+ n._op, uid)
dot.edge(uid
for n1, n2 in edges:
# connect n1 to the op node of n2
str(id(n1)), str(id(n2)) + n2._op)
dot.edge(
return dot
```

` draw_dot(L)`

Lets recap: - we are able to build out mathematical expressions using + and * - Added and tracked grad so we can calculate and update this state later when we do backpropagtion - Forward pass that produces output L and visualized - Now we want to do backpropagation

# Manual Backpropagation

Lets manually nudge the variable a and manually calculate the derivative of L with respect to a, lets create a gating function lol so we donâ€™t pollute the global scope. We can do this for each variable to calculate their derivative with respect to L

### \(L = d * f\)

### \(\frac{dL}{dd} =? f\)

### \(lim_{h \to 0} \frac{(d+h)*f - d*f}{h}\)

### \(lim_{h \to 0} \frac{d*f + h*f - d*f}{h}\)

### \(lim_{h \to 0} \frac{h*f}{h}\)

### \(f\)

So we can see that d.grad is just the value of f which is -2.0 and by the property of symmetry f.grad is just the value of d which is 4.0, lets go ahead and manually set these

```
= 4.0
f.grad = -2.0 d.grad
```

`= 1 L.grad `

```
def lol():
= 0.0001
h
= Value(2.0, label='a')
a = Value(-3.0, label='b')
b = Value(10.0, label='c')
c = a*b; e.label = 'e'
e = e + c; d.label = 'd'
d = Value(-2.0, label='f')
f = d * f; L.label = 'L'
L = L.data
L1
# this is the variable we are nudging by h
= Value(2.0 , label='a')
a = Value(-3.0, label='b')
b = Value(10.0, label='c')
c = a*b; e.label = 'e'
e = e + c; d.label = 'd'
d += h
d.data = Value(-2.0, label='f')
f = d * f; L.label = 'L'
L = L.data
L2
print((L2-L1)/h)
lol()
```

`-1.9999999999953388`

So we have just derived the derivates of f and d with respect to L in a step by step way, now next will uncover the core of backpropagation, we want derive the derivative of L with respect to c and e.

We now know how L is sensitive to d and we know how e and c are sensitive to d, we can know put that together to figure out how L is sensitive to e and c.

### If a variable z depends on the variable y, which itself depends on the variable x (that is, y and z are dependent variables), then z depends on x as well, via the intermediate variable y. In this case, the chain rule is expressed as

# \(\frac{dz}{dx} = \frac{dz}{dy} * \frac{dy}{dx}\)

The chain rule is fundamentally telling you how we chain these derivatives together correctly so to differentiate through a function composition we have to apply a multiplication of those derivatives

`The inuitive explanation here is that knowing the instantaneous rate of change of z with respect to y and y relative to x allows one to calculate the instantaneous rate of change of z`

### If a car travels twice as fast as a bicyle and the cycle is four times as fast as a walking man then the car is 2 * 4 = 8 times faster than the man

We know the derivative of \(\frac{dL}{dd}\) and \(\frac{dd}{dc}\) and want to find \(\frac{dL}{dc}\) the chain rule tells us that \(\frac{dL}{dc} = \frac{dL}{dd} * \frac{dd}{dc}\)

\(1.0 * -2.0 = -2.0\)

The chain rule is telling us for plus nodes â€ś+â€ť we are just routing the gradient because the local derivative is just 1.0

```
= -2.0
e.grad = -2.0 c.grad
```

` draw_dot(L)`

### Lets manually check our work by nudging c

```
def lol():
= 0.0001
h
= Value(2.0, label='a')
a = Value(-3.0, label='b')
b = Value(10.0, label='c')
c = a*b; e.label = 'e'
e = e + c; d.label = 'd'
d = Value(-2.0, label='f')
f = d * f; L.label = 'L'
L = L.data
L1
= Value(2.0 , label='a')
a = Value(-3.0, label='b')
b # this is the variable we are nudging by h
= Value(10.0, label='c')
c += h
c.data = a*b; e.label = 'e'
e = e + c; d.label = 'd'
d = Value(-2.0, label='f')
f = d * f; L.label = 'L'
L = L.data
L2
print((L2-L1)/h)
lol()
```

`-1.9999999999953388`

As we expected c.grad equals -2.0*

## Now we will recurse our way backwards again and going to do our second application of the chain rule

### \(\frac{dL}{de} = -2.0\)

### \(\frac{de}{da} = b\)

### \(\frac{dL}{da} = \frac{dL}{de} * \frac{de}{da}\)

We are multiplying the derivative of e with respect to L with the local gradients

```
= -2.0 * -3.0
a.grad = -2.0 * 2.0 b.grad
```

` draw_dot(L)`

Lets verify

```
def lol():
= 0.0001
h
= Value(2.0, label='a')
a = Value(-3.0, label='b')
b = Value(10.0, label='c')
c = a*b; e.label = 'e'
e = e + c; d.label = 'd'
d = Value(-2.0, label='f')
f = d * f; L.label = 'L'
L = L.data
L1
# this is the variable we are nudging by h
= Value(2.0 , label='a')
a += h
a.data = Value(-3.0, label='b')
b = Value(10.0, label='c')
c = a*b; e.label = 'e'
e = e + c; d.label = 'd'
d = Value(-2.0, label='f')
f = d * f; L.label = 'L'
L = L.data
L2
print((L2-L1)/h)
lol()
```

`6.000000000021544`

Checks out

## We know know what back propagation is; a recursive application of the chain rule backwards through the computational graph

## Neuron Example

One step optimization

```
+= 0.01 * a.grad
a.data += 0.01 * b.grad
b.data += 0.01 * c.grad
c.data += 0.01 * f.grad
f.data
= a * b
e = e + c
d = d * f
L
print(L.data)
```

`-7.286496`

For our model of neurons we have input axis and these synapses that have weights on them so the wâ€™s are the weights and then the synapse interacts with the input multiplicatively so what flows to the cell body of this neuron is w times x but thereâ€™s multiple inputs so thereâ€™s many w times xâ€™s flowing into the cell body, the cell body also has some bias which is a sort of trigger happiness of this neuron, making it more or less prone to firing. Then we take it through an activation function which is generally some kinda of squashing function like a sigmoid or tanh. Lets go over an example of a tanh activation function

`-5, 5, 0.2), np.tanh(np.arange(-5, 5, 0.2))); plt.grid(); plt.plot(np.arange(`

You can see that the inputs that come in get squashed here on the y axis, the function gets capped at 1.00 and -1.00

```
# inputs x1,x2
= Value(2.0, label='x1')
x1 = Value(0.0, label='x2')
x2 # weights w1,w2
= Value(-3.0, label='w1')
w1 = Value(1.0, label='w2')
w2 # bias of the neuron
#6.8813735870195432
= Value(6.8813735870195432, label='b')
b # x1*w1 + x2*w2 + b
= x1*w1; x1w1.label = 'x1*w1'
x1w1 = x2*w2; x2w2.label = 'x2*w2'
x2w2 = x1w1 + x2w2; x1w1x2w2.label = 'x1*w1 + x2*w2'
x1w1x2w2 = x1w1x2w2 + b; n.label = 'n'
n draw_dot(n)
```

We need to add more operations to our Value class to be able to calculate our activation function tanh, lets just do a cheeky implementation of tanh on our value class for now

```
class Value:
def __init__(self, data, _children=(), _op='', label=''):
self.data = data
self._prev = set(_children)
self._op = _op
self.label = label
self.grad = 0.0
def __repr__(self):
return f"Value(data={self.data})"
def __add__(self, other):
= Value(self.data + other.data, (self, other), '+')
out return out
def __mul__(self, other):
= Value(self.data * other.data, (self, other), '*')
out return out
def tanh(self):
= self.data
x = (math.exp(2*x) - 1)/(math.exp(2*x) + 1)
t = Value(t, (self, ), 'tanh')
out return out
```

```
# inputs x1,x2
= Value(2.0, label='x1')
x1 = Value(0.0, label='x2')
x2 # weights w1,w2
= Value(-3.0, label='w1')
w1 = Value(1.0, label='w2')
w2 # bias of the neuron
#6.8813735870195432
= Value(6.8813735870195432, label='b')
b # x1*w1 + x2*w2 + b
= x1*w1; x1w1.label = 'x1*w1'
x1w1 = x2*w2; x2w2.label = 'x2*w2'
x2w2 = x1w1 + x2w2; x1w1x2w2.label = 'x1*w1 + x2*w2'
x1w1x2w2 = x1w1x2w2 + b; n.label = 'n'
n draw_dot(n)
```

`= n.tanh(); o.label = 'o' o `

` draw_dot(o)`

Awesome n goes through tanh to produce the last output, our activation function is working great, now all we need to know is the derivative of tanh and we can use backpropagation.

`= 1.0 o.grad `

Lets calculte the gradient of n

`1 - o.data**2`

`0.4999999999999999`

`= 0.5 n.grad `

Now we can easily get the gradients for x1w1x2w1, b, x1w1, x2w2 since we used addition as an operation the local derivatives are just 1 so we just take the value 0.5

```
= 0.5
x1w1x2w2.grad = 0.5
b.grad = 0.5
x1w1.grad = 0.5 x2w2.grad
```

We can know calculate the gradients for x2, w2, x1, and w1, but unlike the last gradients we used multiplication as our operation, so our local derivative is just the other term used in the operation so lets calculate the gradients

```
= w2.data * x2w2.grad
x2.grad = x2.data * x2w2.grad
w2.grad = w1.data * x1w1.grad
x1.grad = x1.data * x1w1.grad w1.grad
```

` draw_dot(o)`

Nice!, we have manually used backpropagation to calculate our gradients, now lets implement a backward function for each operation

# Backward Function

Lets add a backward methods to our Value object for each operation that we can call to calculate our gradients for us using backpropagation

```
class Value:
def __init__(self, data, _children=(), _op='', label=''):
self.data = data
self._prev = set(_children)
self._op = _op
self.label = label
self.grad = 0.0
self._backward = lambda: None
def __repr__(self):
return f"Value(data={self.data})"
def __add__(self, other):
= Value(self.data + other.data, (self, other), '+')
out
def _backward():
self.grad += 1.0 * out.grad
+= 1.0 * out.grad
other.grad = _backward
out._backward
return out
def __mul__(self, other):
= Value(self.data * other.data, (self, other), '*')
out
def _backward():
self.grad += other.data * out.grad
+= self.data * out.grad
other.grad = _backward
out._backward
return out
def tanh(self):
= self.data
x = (math.exp(2*x) - 1)/(math.exp(2*x) + 1)
t = Value(t, (self, ), 'tanh')
out
def _backward():
self.grad += (1 - t**2) * out.grad
= _backward
out._backward
return out
```

```
# inputs x1,x2
= Value(2.0, label='x1')
x1 = Value(0.0, label='x2')
x2 # weights w1,w2
= Value(-3.0, label='w1')
w1 = Value(1.0, label='w2')
w2 # bias of the neuron
#6.8813735870195432
= Value(6.8813735870195432, label='b')
b # x1*w1 + x2*w2 + b
= x1*w1; x1w1.label = 'x1*w1'
x1w1 = x2*w2; x2w2.label = 'x2*w2'
x2w2 = x1w1 + x2w2; x1w1x2w2.label = 'x1*w1 + x2*w2'
x1w1x2w2 = x1w1x2w2 + b; n.label = 'n'
n = n.tanh(); o.label = 'o'
o draw_dot(o)
```

We initialize the gradient of o to 1.0, then call _backward to recursively calculate the gradients

`= 1.0 o.grad `

` o._backward()`

` draw_dot(o)`

```
n._backward()
b._backward()
x1w1x2w2._backward()
x2w2._backward() x1w1._backward()
```

` draw_dot(o)`

Awesome, this works great, we never want to call backward() on a node before we have calculated the gradients for everthing after it since it depends on their gradients. We will use topological sort which lays the graph such that all the edges go only from left to right.

```
# inputs x1,x2
= Value(2.0, label='x1')
x1 = Value(0.0, label='x2')
x2 # weights w1,w2
= Value(-3.0, label='w1')
w1 = Value(1.0, label='w2')
w2 # bias of the neuron
#6.8813735870195432
= Value(6.8813735870195432, label='b')
b # x1*w1 + x2*w2 + b
= x1*w1; x1w1.label = 'x1*w1'
x1w1 = x2*w2; x2w2.label = 'x2*w2'
x2w2 = x1w1 + x2w2; x1w1x2w2.label = 'x1*w1 + x2*w2'
x1w1x2w2 = x1w1x2w2 + b; n.label = 'n'
n = n.tanh(); o.label = 'o'
o draw_dot(o)
```

```
# topological sort
= []
topo = set()
visited def build_topo(v):
if v not in visited:
visited.add(v)for child in v._prev:
build_topo(child)
topo.append(v)
build_topo(o) topo
```

```
[Value(data=1.0),
Value(data=0.0),
Value(data=0.0),
Value(data=2.0),
Value(data=-3.0),
Value(data=-6.0),
Value(data=-6.0),
Value(data=6.881373587019543),
Value(data=0.8813735870195432),
Value(data=0.7071067811865476)]
```

Our Value objects are now ordered properly

`= 1.0 o.grad `

```
for node in reversed(topo):
node._backward()
```

` draw_dot(o)`

# Lets implement this as a method in the Value object and add more operations

```
class Value:
def __init__(self, data, _children=(), _op='', label=''):
self.data = data
self._prev = set(_children)
self._op = _op
self.label = label
self.grad = 0.0
self._backward = lambda: None
def __repr__(self):
return f"Value(data={self.data})"
def __add__(self, other):
= other if isinstance(other, Value) else Value(other)
other = Value(self.data + other.data, (self, other), '+')
out
def _backward():
self.grad += 1.0 * out.grad
+= 1.0 * out.grad
other.grad = _backward
out._backward
return out
def __mul__(self, other):
= other if isinstance(other, Value) else Value(other)
other = Value(self.data * other.data, (self, other), '*')
out
def _backward():
self.grad += other.data * out.grad
+= self.data * out.grad
other.grad = _backward
out._backward
return out
def __rmul__(self, other):
return self * other
def __truediv__(self, other):
return self * other**-1
def __neg__(self): # -self
return self * -1
def __sub__(self, other): # self - other
return self + (-other)
def __radd__(self, other): # other + self
return self + other
def __pow__(self, other):
assert isinstance(other, (int, float)), "only supporting int/float powers for now"
= Value(self.data**other, (self,), f'**{other}')
out
def _backward():
self.grad += other * (self.data ** (other - 1)) * out.grad
= _backward
out._backward
return out
def tanh(self):
= self.data
x = (math.exp(2*x) - 1)/(math.exp(2*x) + 1)
t = Value(t, (self, ), 'tanh')
out
def _backward():
self.grad += (1 - t**2) * out.grad
= _backward
out._backward
return out
def exp(self):
= self.data
x = Value(math.exp(x), (self, ), 'exp')
out
def _backward():
self.grad += out.data * out.grad
= _backward
out._backward
return out
# Now we can call backward on our Value object
def backward(self):
= []
topo = set()
visited def build_topo(v):
if v not in visited:
visited.add(v)for child in v._prev:
build_topo(child)
topo.append(v)self)
build_topo(
self.grad = 1.0
for node in reversed(topo):
node._backward()
```

```
= Value(2.0)
a = Value(4.0)
b - b a
```

`Value(data=-2.0)`

```
= Value(2.0)
a a.exp()
```

`Value(data=7.38905609893065)`

```
# inputs x1,x2
= Value(2.0, label='x1')
x1 = Value(0.0, label='x2')
x2 # weights w1,w2
= Value(-3.0, label='w1')
w1 = Value(1.0, label='w2')
w2 # bias of the neuron
= Value(6.8813735870195432, label='b')
b # x1*w1 + x2*w2 + b
= x1*w1; x1w1.label = 'x1*w1'
x1w1 = x2*w2; x2w2.label = 'x2*w2'
x2w2 = x1w1 + x2w2; x1w1x2w2.label = 'x1*w1 + x2*w2'
x1w1x2w2 = x1w1x2w2 + b; n.label = 'n'
n = n.tanh(); o.label = 'o'
o
o.backward() draw_dot(o)
```

```
# inputs x1,x2
= Value(2.0, label='x1')
x1 = Value(0.0, label='x2')
x2 # weights w1,w2
= Value(-3.0, label='w1')
w1 = Value(1.0, label='w2')
w2 # bias of the neuron
= Value(6.8813735870195432, label='b')
b # x1*w1 + x2*w2 + b
= x1*w1; x1w1.label = 'x1*w1'
x1w1 = x2*w2; x2w2.label = 'x2*w2'
x2w2 = x1w1 + x2w2; x1w1x2w2.label = 'x1*w1 + x2*w2'
x1w1x2w2 = x1w1x2w2 + b; n.label = 'n'
n # ----
= (2*n).exp()
e = (e - 1) / (e + 1)
o # ----
= 'o'
o.label
o.backward() draw_dot(o)
```

# Pytorch comparison

```
import torch
= torch.Tensor([2.0]).double() ; x1.requires_grad = True
x1 = torch.Tensor([0.0]).double() ; x2.requires_grad = True
x2 = torch.Tensor([-3.0]).double() ; w1.requires_grad = True
w1 = torch.Tensor([1.0]).double() ; w2.requires_grad = True
w2 = torch.Tensor([6.8813735870195432]).double() ; b.requires_grad = True
b = x1*w1 + x2*w2 + b
n = torch.tanh(n)
o
print(o.data.item())
o.backward()
print('---')
print('x2', x2.grad.item())
print('w2', w2.grad.item())
print('x1', x1.grad.item())
print('w1', w1.grad.item())
```

```
0.7071066904050358
---
x2 0.5000001283844369
w2 0.0
x1 -1.5000003851533106
w1 1.0000002567688737
```

` o`

`tensor([0.7071], dtype=torch.float64, grad_fn=<TanhBackward0>)`

` o.item()`

`0.7071066904050358`

```
import random
class Neuron:
def __init__(self, nin):
self.w = [Value(random.uniform(-1,1)) for _ in range(nin)]
self.b = Value(random.uniform(-1,1))
def __call__(self, x):
# w * x + b
= sum((wi*xi for wi, xi in zip(self.w, x)), self.b)
act = act.tanh()
out return out
def parameters(self):
return self.w + [self.b]
class Layer:
def __init__(self, nin, nout):
self.neurons = [Neuron(nin) for _ in range(nout)]
def __call__(self, x):
= [n(x) for n in self.neurons]
outs return outs[0] if len(outs) == 1 else outs
def parameters(self):
return [p for neuron in self.neurons for p in neuron.parameters()]
# params = []
# for neuron in self.neurons:
# ps = neuron.parameters()
# params.extend(ps)
# return params
class MLP:
def __init__(self, nin, nouts):
= [nin] + nouts
sz self.layers = [Layer(sz[i], sz[i+1]) for i in range(len(nouts))]
def __call__(self, x):
for layer in self.layers:
= layer(x)
x return x
def parameters(self):
return [p for layer in self.layers for p in layer.parameters()]
```

```
= [2.0, 3.0, -1.0]
x = MLP(3, [4, 4, 1])
n n(x)
```

`Value(data=-0.7028959990425087)`

`len(n.parameters())`

`41`

```
= [
xs 2.0, 3.0, -1.0],
[3.0, -1.0, 0.5],
[0.5, 1.0, 1.0],
[1.0, 1.0, -1.0]
[
]= [1.0, -1.0, -1.0, 1.0]
ys = [n(x) for x in xs]
ypred ypred
```

```
[Value(data=-0.7028959990425087),
Value(data=0.1757758058598642),
Value(data=0.39624177470697325),
Value(data=0.16264413106842704)]
```

So how do we tune the weights to better predict the desired targets? We calculate a single number that measures the total performance of the neural net, this is called the loss.

```
= sum((yout - ygt)**2 for ygt, yout in zip(ys, ypred))
loss loss
```

`Value(data=6.932959473871423)`

` loss.backward()`

Now that we have called backward on the loss, we can take a look at the gradient of a single neuron in one of our layers to get a look on how it impacts our loss, this will be useful soon when we try to update our weights to decrease the loss

`0].neurons[0].w[0].data n.layers[`

`0.28438888706081467`

The gradient for this neuron is positive, so the weights is increasing our loss

`0].neurons[0].w[0].grad n.layers[`

`1.711566891777295`

`0].neurons[0].b.grad n.layers[`

`1.0849775541180198`

We can now also call draw_dot on our loss, we can see the DAG has increased tremendously in compelexity

` draw_dot(loss)`

We can update our parameters by multiplying by substracting them by their gradients multiplied by a learning rate (the desired impact of the gradient has in updating our parameters, here we use 0.01 arbitrarily, but there are many techniques to find an optimal learning rate and to decay the rate as training continues, 0.1 is also a good rule of thumb, too big and you can overstep too small and its costly to train

```
for p in n.parameters():
-= 0.01 * p.grad p.data
```

`0].neurons[0].w[0].data n.layers[`

`0.2672732181430417`

After updating our parameters we can confirm our loss decreased

```
= [n(x) for x in xs]
ypred = sum((yout - ygt)**2 for ygt, yout in zip(ys, ypred))
loss loss
```

`Value(data=6.317575661771169)`

What we have done is gradient decent. Forward pass -> backward pass -> update the parameters. Now we just have to iterate this process, lets turn this manual process into a training loop

```
for k in range(10):
# forward pass
= [n(x) for x in xs]
ypred = sum((yout - ygt)**2 for ygt, yout in zip(ys, ypred))
loss
#backward pass
loss.backward()
#update
for p in n.parameters():
-= 0.05 * p.grad
p.data = 0
p.grad print(k, loss.data)
```

```
0 0.10345239945960188
1 0.0941558588278075
2 0.0862952014023131
3 0.07956846088984432
4 0.07375215882362336
5 0.06867739459504289
6 0.06421419054843136
7 0.06026093580621215
8 0.056737086525177276
9 0.05357800634860667
```

` ypred`

```
[Value(data=0.8741761793258574),
Value(data=-0.8786120732704685),
Value(data=-0.8952118169896238),
Value(data=0.8903150856076475)]
```