I began learning from the Math of DL book and gained some understanding of Numpy tools, which can generate probability distributions and values based on given distributions. This changed my perspective, and I started directly learning about neural networks and back-propagation from karpathy’s videos.
- The idea that simple connected neurons can produce and learn any input is fascinating and one should pay more attention to it.
- Inspired by the brain model, back-propagation is a powerful idea from the last century. If you understand manifolds, derivatives, and computation, you can grasp how we can use them to learn anything and create intelligent systems that understand the world. There are many potential applications, but finding the smartest humans working on this problem is exciting. — Similar to General relativity, the math could’ve been solved by anyone at the University of Gottingen (according to Hilbert), but Einstein’s thought revolutionized our perception of reality.
Neural network — Simple understanding
- Basic network — Neural networks are fundamentally simple in their basic concept. You have a multitude of inputs, which we can denote as x1, x2, …, xn (the specific values of these inputs do not matter, whether they are English alphabet letters, sound signals, or image pixels). These input values are transmitted to the first-level neurons through synapses, which are essentially electrical threads that facilitate the passage of information. Each neuron possesses weights, which are numerical values that determine the strength of the connection between the neuron and its inputs. Imagine billions of neurons and synapses, and consider the intricate process through which these values undergo rigorous transformations. Ultimately, these transformed values yield outputs, which exhibit correlations with the inputs. This phenomenon can be referred to as “Understanding”.
- Network equation (W & b) — Let’s say a synapse (W) value is 0.314. Any input (X) passed through this synapse outputs 0.314*X. If X value pass through multiple neurons, the output will be a constant times X. This doesn’t seem intelligent. It’s just a guess based on brain structure. The information flowing through neurons isn’t always a multiple of X, but an additional value called bias. This is a reasonable neuron model. However, the entire input goes through a linear transformation. No matter the values of W & b, they can’t be intelligence; it’s more like rot learning. To solve this, we add non-linearity to neurons so the values go through a learning curve such as sigmoid, tanh etc. These are best guesses, but they don’t solve “The Big Problem”. We haven’t achieved AGI, but they provide great utility to solve complex problems.
- Initialization of W & b — We first initialize them as random values and run through the network (layered neurons) and check the output value from network and input labels and check the how different both are. (Input labels are probability distributions and output values from network also probability distributions). We have many statistics formulas to understand such distributions. You can call it as a loss function or cross-entropy, but it tells the difference between two probability distributions. (Entropy, by definition is understanding the randomness. In this case, randomness in the network). We need to optimize or minimize this entropy such that for all input values the entropy remain intact by adjusting W & b values. Achieving a reasonable entropy at the outset is itself a complex challenge. Fortunately, our understanding of probability distributions, particularly the normal (Gaussian) distribution, provides effective methods for initialization. The normal distribution is ubiquitous in nature and offers a robust starting point for parameter initialization. When combined with normalization techniques applied across different layers of the network, this approach helps establish a well-calibrated entropy at the beginning of the training process.
3. Adjust W & b so that we reduce the entropy — Model training.
For each input value, calculate its cross-entropy and adjust W & b so that network inputs are closer to input labels.
Consider the entire equation as a manifold. Change the manifold equation to better fit the data. This requires significant transformation and may cause the manifold to diverge or become unstable with each iteration.
Use gradient descent to adjust W & b. Calculate the gradient at the current point and slightly change W & b to reduce the entropy slightly.
Example: 2x1+5x2+6 (2x1+4 + 5x2+2 — w1 = 2, b1 = 4, w2=2, b2=2) Produces entropy X. To reduce it, we change the equations, like 2.1x1+5.01x2+5.34 for given input (0.4, 7).
These values may differ if we run with different inputs. But once we complete a set, the values may become identical regardless of the order.
For complex neural networks, the equation can be very complex. We can print it in terms of input dimensions, but our tools don’t support this yet but it would be cool to do it.
One simple example is: tanh((((x1 -3.876549626386776) + (x2 1.0239637844356497)) + 6.853873587019543))
For Chat GPT V2, we have 50K tokens and millions of parameters, so the equation is going to be enormous.
In neural networks, the number of input dimensions generally matters. However, small changes in the number of dimensions may have minimal impact unless there is a significant increase (e.g., an order of magnitude), which could affect computational complexity and model performance. This provides flexibility for anyone building models.