Update lecture_en.md

moonbitlang · Jun 3, 2024 · 10d6916 · 10d6916
1 parent 6436c4e
commit 10d6916
Showing 1 changed file with 7 additions and 7 deletions.
diff --git a/course12/lecture_en.md b/course12/lecture_en.md
@@ -28,7 +28,7 @@ This way, we can gradually approach zero and get an approximate solution. We wil
 
 ![height:600px](../pics/geogebra-export%20(8).png)
 
-Today, we will look at the following simple combination of functions, involving only addition and multiplication. For example, when calculating 5 times $x_0$ squared plus $x_1$, if $x_0$ is 10 and $x_1$ is 100, we need to calculate the value of the function, 600, the partial derivative with respect to $x0$, 100, and the partial derivative with respect to $x_1$, 1.
+Today, we will look at the following simple combination of functions, involving only addition and multiplication. For example, when calculating 5 times $x_0$ squared plus $x_1$, if $x_0$ is 10 and $x_1$ is 100, we need to calculate the value of the function, 600, the partial derivative with respect to $x_0$, 100, and the partial derivative with respect to $x_1$, 1.
 
 *Example:* $f(x_0, x_1) = 5{x_0}^2 + {x_1}$
 
@@ -38,7 +38,7 @@ Today, we will look at the following simple combination of functions, involving
 
 # Differentiation
 
-There are several ways to differentiate a function. The first method is manual differentiation where we use a piece of paper and a pen as a natural calculator. The drawback is that it's easy to make mistakes with complex expressions and we can't just manually calculate 24 hours a day. The second method is numerical differentiation: $\frac{ \texttt{f}(x + \delta x) - \texttt{f}(x) }{ \delta x }$, where we add a small value (approaching zero) to the point we want to differentiate, calculate the difference, and divide it by the small value. The issue here is that computers cannot accurately represent decimals, and the larger the absolute value, the less accurate it is. Also, we cannot fully solve infinite series. The third method is symbolic differentiation, where we convert the function into an expression tree and then operate on the tree to get the derivative. Take $Mul(Const(2), Var(1)) \to Const(2)$ for example: here the differentiation result of constant 2 multiplied by x will be constant 2. The problem with symbolic differentiation is that the calculation results may not be simplified enough, and there may be redundant calculations. In addition, it's hard to directly use native control flow like conditionals and loops. If we want to define a function to find the larger value, we have to define an operator instead of simply comparing the current values. 
+There are several ways to differentiate a function. The first method is manual differentiation where we use a piece of paper and a pen as a natural calculator. The drawback is that it's easy to make mistakes with complex expressions and we can't just manually calculate 24 hours a day. The second method is numerical differentiation: $\frac{ \texttt{f}(x + \delta x) - \texttt{f}(x) }{ \delta x }$, where we add a small value (approaching zero) to the point we want to differentiate, calculate the difference, and divide it by the small value. The issue here is that computers cannot accurately represent decimals, and the larger the absolute value, the less accurate it is. Also, we cannot fully solve infinite series. The third method is symbolic differentiation, where we convert the function into an expression tree and then operate on the tree to get the derivative. Take $\textit{Mul(Const(2), Var(1))} \to \textit{Const(2)}$ for example: here the differentiation result of constant 2 multiplied by x will be constant 2. The problem with symbolic differentiation is that the calculation results may not be simplified enough, and there may be redundant calculations. In addition, it's hard to directly use native control flow like conditionals and loops. If we want to define a function to find the larger value, we have to define an operator instead of simply comparing the current values. 
 
 ```moonbit no-check
 // Need to define additional native operators for the same effect
@@ -225,12 +225,12 @@ inspect(Forward::var(10.0, false) * Forward::var(100.0, true), ~content="{value:
 
 ### Backward Differentiation
 
-Backward differentiation utilizes the chain rule for calculation. Suppose we have a function $w$ of $x, y, z$, etc., and $x, y, z$, etc. are functions of $t$. Then the partial derivative of $w$ with respect to $t$ is the partial derivative of $w$ with respect to $x$ times the partial derivative of $x$ with respect to $t$, plus the partial derivative of $w$ with respect to $y$ times the partial derivative of $y$ with respect to $t$, plus the partial derivative of $w$ with respect to $z$ times the partial derivative of $z$ with respect to $t$, and so on. 
+Backward differentiation utilizes the chain rule for calculation. Suppose we have a function $w$ of $x$, $ y$, $z$, etc., and $x$, $y$, $z$, etc. are functions of $t$. Then the partial derivative of $w$ with respect to $t$ is the partial derivative of $w$ with respect to $x$ times the partial derivative of $x$ with respect to $t$, plus the partial derivative of $w$ with respect to $y$ times the partial derivative of $y$ with respect to $t$, plus the partial derivative of $w$ with respect to $z$ times the partial derivative of $z$ with respect to $t$, and so on. 
 
 - Given $w = f(x, y, z, \cdots), x = x(t), y = y(t), z = z(t), \cdots$  
   $\frac{\partial w}{\partial t} = \frac{\partial w}{\partial x} \frac{\partial x}{\partial t} + \frac{\partial w}{\partial y} \frac{\partial y}{\partial t} + \frac{\partial w}{\partial z} \frac{\partial z}{\partial t} + \cdots$
 
-For example, for $f(x0, x1) = x0 ^ 2 \times x1$, we can consider $f$ as a function of $g$ and $h$, where $g$ and $h$ are $x0 ^ 2$ and $x1$ respectively. We differentiate each component: the partial derivative of $f$ with respect to $g$ is $h$;  the partial derivative of $f$ with respect to $h$ is $g$;  the partial derivative of $g$ with respect to $x_0$ is $2x_0$, and the partial derivative of $h$ with respect to $x_0$ is 0. Lastly, we combine them using the chain rule to get the result $2x_0x_1$. Backward differentiation is the process where we start with the partial derivative of $f$ with respect to $f$, followed by calculating the partial derivatives of $f$ with respect to the intermediate functions and their partial derivatives with respect to the intermediate functions, until we reach the partial derivatives with respect to the input parameters. This way, by tracing backward and creating the computation graph of $f$ in reverse order, we can compute the derivative of each input node. This is suitable for cases where there are more input parameters than output parameters.
+For example, for $f(x_0, x_1) = x_0 ^ 2 \times x_1$, we can consider $f$ as a function of $g$ and $h$, where $g$ and $h$ are $x_0 ^ 2$ and $x_1$ respectively. We differentiate each component: the partial derivative of $f$ with respect to $g$ is $h$;  the partial derivative of $f$ with respect to $h$ is $g$;  the partial derivative of $g$ with respect to $x_0$ is $2x_0$, and the partial derivative of $h$ with respect to $x_0$ is 0. Lastly, we combine them using the chain rule to get the result $2x_0x_1$. Backward differentiation is the process where we start with the partial derivative of $f$ with respect to $f$, followed by calculating the partial derivatives of $f$ with respect to the intermediate functions and their partial derivatives with respect to the intermediate functions, until we reach the partial derivatives with respect to the input parameters. This way, by tracing backward and creating the computation graph of $f$ in reverse order, we can compute the derivative of each input node. This is suitable for cases where there are more input parameters than output parameters.
 
 - Example: $f(x_0, x_1) = {x_0} ^ 2 x_1$
   - Decomposition: $f = g h, g(x_0, x_1) = {x_0} ^ 2, h(x_0, x_1) = x_1$
@@ -259,7 +259,7 @@ fn Backward::backward(b : Backward, d : Double) -> Unit { (b.backward)(d) }
 fn Backward::value(backward : Backward) -> Double { backward.value }
 ```
 
-Next, let's look at addition and multiplication. Suppose the functions $g$ and $h$ are involved in computation, the current function is $f$, and the final result is $y$, with $x$ as a parameter. We've previously mentioned the partial derivatives of $f$ with respect to $g$ and $h$ and will omit them here. For the accumulated partial derivative of $y$ with respect to $x$, the partial derivative through the path of $f$ and $g$ is the partial derivative of $y$ with respect to $f$ times the partial derivative of $f$ with respect to $g$ times the partial derivative of $g$ with respect to $x$. Here, the partial derivative of $y$ with respect to $f$ corresponds to the parameter $diff$ in the `backward` function. So we can see in line 4 that the parameter we pass to $g$ is $diff \times 1.0$, which corresponds to the partial derivative of $y$ with respect to $f$ times the partial derivative of $f$ with respect to $g$. We'll pass a similar parameter to $h$. In line 11, according to the derivative rules, the parameter passed to $g$ is $diff$ times the current value of $h$, and the parameter passed to $h$ is $diff$ times the current value of $g$.
+Next, let's look at addition and multiplication. Suppose the functions $g$ and $h$ are involved in computation, the current function is $f$, and the final result is $y$, with $x$ as a parameter. We've previously mentioned the partial derivatives of $f$ with respect to $g$ and $h$ and will omit them here. For the accumulated partial derivative of $y$ with respect to $x$, the partial derivative through the path of $f$ and $g$ is the partial derivative of $y$ with respect to $f$ times the partial derivative of $f$ with respect to $g$ times the partial derivative of $g$ with respect to $x$. Here, the partial derivative of $y$ with respect to $f$ corresponds to the parameter $\textit{diff}$ in the `backward` function. So we can see in line 4 that the parameter we pass to $g$ is $\textit{diff} \times 1.0$, which corresponds to the partial derivative of $y$ with respect to $f$ times the partial derivative of $f$ with respect to $g$. We'll pass a similar parameter to $h$. In line 11, according to the derivative rules, the parameter passed to $g$ is $\textit{diff}$ times the current value of $h$, and the parameter passed to $h$ is $\textit{diff}$ times the current value of $g$.
 
 ```moonbit
 fn Backward::op_add(g : Backward, h : Backward) -> Backward {
@@ -291,7 +291,7 @@ test "Backward differentiation" {
 }
 ```
 
-Now with backward differentiation, we can try to write a neural network. However, due to time constraints, we'll only demonstrate automatic differentiation and Newton's method to approximate zeros. Let's use the interface to define the functions we saw at the beginning.
+Now with backward differentiation, we can try to write a neural network. In this lecture, we'll only demonstrate automatic differentiation and Newton's method to approximate zeros. Let's use the interface to define the functions we saw at the beginning.
 
 Then, we'll use Newton's method to find the value. Since there is only one parameter, we'll use forward differentiation.
 
@@ -325,7 +325,7 @@ Let's define $x$ as the iteration variable with an initial value of 1.0. Since $
 
 # Summary
 
-To summarize, in this lecture we introduced the concept of automatic differentiation. We presented symbolic differentiation and two different implementations of automatic differentiation. For students interested in learning more, we recommend the *3Blue1Brown* series on deep learning (including topics like gradient descent, backpropagation algorithms), and try to write your own neural network.
+To summarize, in this lecture we introduced the concept of automatic differentiation. We presented symbolic differentiation and two different implementations of automatic differentiation. For students interested in learning more, we recommend the *3Blue1Brown* series on deep learning (including topics like [gradient descent](https://www.youtube.com/watch?v=IHZwWFHWa-w), [backpropagation algorithms](https://www.youtube.com/watch?v=Ilg3gGewQ5U)), and try to write your own neural network.