So this paper is intentionally technical. I wanted to strip AI down to its mechanics. What is it mathematically? Where are its limits? What is substance and what is performance? If this technology is positioned as foundational to the next wave of economic transformation, then I need to understand more than the narrative. I want to be able to distinguish depth from noise.
The mathematics referenced here is something I now understand conceptually and can explain. I may not remember every formula line by line without notes, and in one week I have certainly not become an AI expert. But I have put in the effort to move beyond surface familiarity. I have read, deconstructed, questioned, and rebuilt the foundations in my own words. That matters. There is a difference between quoting terms like backpropagation or universal approximation and actually understanding what they imply.
Artificial intelligence is no longer peripheral. It is embedded in financial markets, consumer platforms, governance systems, and national strategy. If this is the frontier shaping capital and policy, then foundational literacy is not optional. It is necessary.

In classic fashion, I began by building a graphic that attempts to visually deconstruct what AI is, with the help of AI itself. From there, this paper moves from perceptrons to deep learning, from function approximation to ontology. I am not positioning myself as an authority after one week. I am positioning myself as someone who chose to engage deeply rather than repeat what everyone else is saying.
|
“Artificial intelligence is the science and engineering of making intelligent machines, especially intelligent computer programs.”
John McCarthy, 1956[1]
|
Let’s start simple.
Artificial Intelligence is a system that simulates human cognitive capabilities such as learning, reasoning, and decision making using computational models. It runs on algorithms, data, and processing power. [2]That is the baseline definition. But then I ask myself, simulate how?
AI systems perceive input, learn patterns from data, and take goal-oriented actions. Neural networks in chatbots and image recognition systems are not “thinking.” They are analyzing patterns at scale. Deep learning is just layered statistical modelling. A lot of in depth topics (neural nets, deep learning, ML which are far too complicated to be covered in this report). So, if AI can learn and decide, what makes it artificial?
Artificial intelligence is engineered. Humans design the architecture, define the objectives, and feed the data. It mimics cognition without biological evolution, consciousness, or emotion.
Natural intelligence, on the other hand, emerges from evolutionary biology. It is embodied. It feels. It integrates empathy, memory, intuition, and self awareness.
AI processes data probabilistically. Humans experience reality subjectively.
So the next logical question becomes: can we engineer something that crosses that gap?
When I strip everything back, the real divide between artificial and natural intelligence is origin. Not performance, or speed, or benchmark scores that OpenAI gleams about (congratulations, your LLM passed the SATs).

Source: AI Index
Natural intelligence emerges from evolutionary biology. It is the result of roughly 3.8 billion years of selection pressures shaping nervous systems for survival, reproduction, and environmental adaptation. It is embodied in carbon-based “wetware”/biomass[3]. There is no separation between hardware and software in the human brain. Learning physically rewires neural tissue, memory is stored and retrieved through biochemical processes, emotion is regulated through hormonal systems, and cognition is fully integrated within the embodied structure of a living organism.
Artificial intelligence, by contrast, is engineered. It is built on silicon-based hardware running software architectures designed over decades, from perceptron’s (first ever engineered artificial neuron) in the 1950s to transformers in 2017[4]. It’s hardware and software are separable. Its skills can be copied instantly across machines. Its objective functions are defined externally. Its optimization targets are specified by humans.
Let’s touch upon perceptron’s for a minute. A perceptron is basically the simplest version of a machine that can learn from its mistakes.
Imagine you are trying to decide whether an email is spam or not. You look at a few things: how many links it has, how many suspicious words, who sent it, and maybe how long it is. A perceptron does the same thing. It takes these pieces of information as inputs.[5]
Each input is given a weight, which just means how important that factor is. For example, maybe lots of suspicious words matter more than email length. The perceptron multiplies each input by its weight, adds everything together, and gets a total score. If the score is above a certain number, it says “Spam.” If it is below that number, it says “Not Spam.”
That is the decision part.
The learning happens when it makes a mistake. If it guessed wrong, it adjusts the weights using a simple rule:
Change in weight = learning rate × (correct answer - prediction) × input [6]

Source: Medium
The learning rate just controls how big the adjustment is. Over thousands of examples, these small corrections move the decision boundary until the categories are separated properly.
Now here is the learning part. If the perceptron makes the right prediction, nothing changes. But if it makes a mistake, it adjusts the weights. It increases the importance of inputs that should have mattered more or decreases the importance of ones that misled it. Over time, by correcting mistakes again and again, it gets better at separating spam from not spam. Visually, you can think of it like drawing a line between two groups of dots. If the line is in the wrong place, you move it. Eventually, the line separates the groups properly.
Modern AI systems are just much bigger versions of this idea. Instead of one simple decision line, they use many layers of these tiny decision units stacked together to handle more complicated problems. In essence, a perceptron is the basic building block of modern neural networks. It was the first real algorithm that allowed a machine to learn patterns from data instead of just following fixed rules.
Before the perceptron, computers were like calculators. You told them exactly what to do. The perceptron changed that. It allowed a machine to look at examples, make a guess, check if it was wrong, and then adjust itself. The original perceptron in 1958 had around 400 adjustable weights and was trained on punched cards to recognize things like aircraft in photos during the Cold War. It ran on hardware that could only handle millions of operations. Then in 1969, Marvin Minsky (co-founder of the field of artificial intelligence and the MIT AI Lab, known for proving the limits of early perceptron’s and shaping modern AI theory) showed that a single-layer perceptron could not solve certain problems, like XOR logic. Funding dropped. This period became known as the “AI winter.”

Source: Build Electronic Circuit
Why is XOR logic so important?
Let’s first understand what XOR is. XOR means ‘Exclusively OR’; meaning that two variables cannot positively or negatively exist at the same time. If we take a look at the truth table above, if A and B or both 0, then the output through the logic gate of XOR is 0. If A and B or both 1, then the output remains 0. However, if A is 0 and B is 1 (or vice versa), then the output through XOR is 1.
When you plot these four XOR points on a graph, you quickly notice something important: the outputs labelled “1” sit diagonally opposite each other, and the outputs labelled “0” also sit diagonally opposite each other. Because of this arrangement, there is no single straight line that can cleanly separate the 1s from the 0s. That is the core issue.

Source: Research gate
A single perceptron makes decisions by computing a weighted sum of its inputs and then applying a threshold. Geometrically, this is equivalent to drawing one straight line (in two dimensions) to divide data into two categories. This works for functions like AND or OR because their outputs can be separated using one straight boundary. However, XOR cannot be separated this way, which means it is not linearly separable. Since a single-layer perceptron can only create linear boundaries, it structurally cannot solve XOR.
In 1969, Marvin Minsky and Seymour Papert (mathematician and MIT AI researcher, helped shape early artificial intelligence theory) formally proved this limitation in their book Perceptrons. They showed that single layer perceptrons have strict mathematical limits and cannot compute functions like XOR or certain visual connectedness tasks. At the time, this proof had major consequences. Rosenblatt’s 1958 perceptron had generated optimism that machines could soon achieve human-like learning, but Minsky and Papert demonstrated that the model was far more limited than many had believed. As a result, confidence in neural networks declined. Funding agencies such as DARPA reduced support, and neural network research slowed significantly for nearly a decade, contributing to what became known as the first AI winter. DARPA is the U.S. Defense Advanced Research Projects Agency, a government body that funds high risk, high impact technological research. Its significance lies in shaping modern computing and AI by funding foundational breakthroughs such as early internet infrastructure, robotics, and artificial intelligence research.
The impact was not only technical but also psychological. If perceptrons could not even compute a simple logical function like XOR, critics argued that ambitions such as machine vision or artificial consciousness were unrealistic. Research shifted toward symbolic AI approaches that relied on explicit rules rather than learning from data. Neural network publications dropped sharply after 1969.[7]
|
Dimension |
Symbolic AI (Rules-Based) |
Neural Networks (Data-Driven) |
|
Core Idea |
Intelligence through explicit logic and symbols |
Intelligence through learning patterns from data |
|
How It Works |
Humans write rules such as “if X then Y” |
Model adjusts internal weights through optimization |
|
Learning Method |
No learning from data; rules are manually encoded |
Learns from datasets by minimizing prediction error |
|
Knowledge Source |
Expert-defined knowledge base |
Statistical inference from examples |
|
Transparency |
High explainability; reasoning steps visible |
Often opaque; logic distributed across parameters |
|
Data Requirement |
Low data requirement; high expert input |
High data requirement; low manual rule writing |
|
Strength |
Structured reasoning in clearly defined domains |
Pattern recognition in complex, noisy environments |
|
Weakness |
Brittle, struggles with ambiguity and scaling |
Data-hungry and harder to interpret |
The revival began in the mid-1980s with the rediscovery and formalization of backpropagation, which allowed multiple layers of perceptrons to be trained together. In simple terms, it is the process of sending the error backward through the network so each connection knows how much it contributed to the mistake. So backpropagation literally means “spreading the error backward.” Without backpropagation, multi-layer neural networks would not know how to improve their hidden layers. With it, they can update millions or billions of parameters efficiently. By stacking perceptrons into multi-layer networks and adding non-linear activation functions, researchers could model complex patterns, including XOR.

Later theoretical work, such as the universal approximation theorem in 1989, proved that multi-layer neural networks could approximate a wide range of functions. In this way, the XOR problem did not end neural networks; it revealed their limitations and pushed the field toward deeper architectures, ultimately laying the foundation for modern deep learning.
The Universal Approximation Theorem says something very specific and very powerful; a neural network with at least one hidden layer and a non-linear activation function can approximate any continuous function, as closely as you want, if you give it enough neurons. Now, what does this even mean?[8]
It does not say the network will automatically learn the function. It says the network has the capacity to represent it. Now let’s make this intuitive.
Think of a complicated curve, like a wavy sine function.
A single perceptron can only draw a straight line. That means it can only represent linear relationships. So it fails on anything curved, including XOR (as seen previously with the OR gate). Now imagine building that curved shape using small blocks.
Each neuron in the hidden layer creates a small “bump” or step-like shape. If you add enough of these bumps together, you can approximate any smooth curve. The more neurons you add, the closer the approximation becomes. [9]

Source: AI
Formally, the theorem states:
For any continuous function f(x), and any small error ε > 0, there exists a neural network such that the difference between the network’s output and f(x) is less than ε for all inputs in a bounded region.[10]
In plain English:
ε is just a tiny number. It represents how much error you are willing to tolerate.
If ε = 0.1, you are okay being off by 0.1.
If ε = 0.0001, you want to be extremely close.
It does not mean zero. It means “a very small positive number.”
What does ε > 0 mean? It means epsilon is greater than zero. So, you are allowed some error. Just not infinite error, and you can choose how small that allowed error is.
f(x) just means “some function.” Think of it as any rule that takes an input and gives an output.
Example:
If f(x) = x², then when x = 2, f(2) = 4.
If f(x) = sin(x), then it gives you a wave.
A neural network does not “find functions through f(x).” Instead, f(x) represents the true underlying relationship in the world, meaning the real mapping from inputs to outputs that exists whether we know it precisely or not. The neural network does not have direct access to f(x). It only sees examples of inputs paired with outputs. From these examples, it attempts to learn its own function, usually written as ŷ(x), that behaves similarly to f(x).
Structurally, we can think of it like this:
f(x) = the true relationship
ŷ(x) = the network’s approximation
The network adjusts its internal parameters until ŷ(x) is close to f(x) on the data it has observed. It does not explicitly discover the true function. It builds a mathematical approximation of it based on patterns in the data.
More precisely, a neural network is itself a mathematical function. It takes an input x, applies a sequence of weighted transformations and non-linear activation functions, and produces an output ŷ[11]. Formally, we can write this as:
ŷ = NN(x; θ)
Here, x is the input, θ represents all the weights and biases in the network, NN denotes the network’s transformation, and ŷ is the predicted output.
When we say the “function of a neural network,” we mean that it defines a flexible mathematical mapping from input space to output space. During training, the goal is to adjust θ so that:
NN(x; θ) ≈ f(x)
In plain English, the neural network’s job is to learn a rule that converts inputs into accurate outputs[12]. It is not uncovering truth in an absolute sense. It is fitting a function that best explains the data it has been given.
Fast forward to today. Modern neural networks still use the same core idea: weighted inputs, prediction, error correction. But instead of one layer with a few hundred weights, we now have models with hundreds of layers and billions or even trillions of parameters. Instead of thousands of images, we train on billions of data points. Even transformers and large language models are built on error-driven weight updates using gradient descent. Around 80 percent of production AI systems today still rely on this fundamental perceptron principle.

So what are the applications of neural networks?
At the most basic level, neural networks are used anywhere we need to map inputs to outputs and detect complex patterns that are too messy for simple rules. In stock market forecasting, neural networks analyse inputs such as historical prices, trading volumes, and economic indicators to predict short-term price direction. Even small improvements in directional accuracy, such as 55 to 65 percentfoot[13] over short time horizons, can become economically significant when deployed at scale in algorithmic trading systems.
In fraud detection, banks use neural networks to evaluate transaction features like amount, location, timing, IP address, and spending velocity. Instead of relying on fixed thresholds, the model produces a risk score based on learned patterns, significantly reducing missed fraudulent transactions compared to rule-based systems.
In recommendation systems, neural networks process user behavior data and item characteristics to rank products or content. By learning which patterns of past behavior predict future interest, platforms can increase click-through rates and overall engagement by substantial margins.
In health risk prediction, neural networks analyze structured medical inputs such as age, BMI, glucose levels, and other biomarkers to estimate the probability of disease onset. These models often achieve high performance metrics and support preventive decision-making in clinical contexts[14].
Across all these applications, the structure is the same: input features go in, weighted transformations occur across layers, and a prediction comes out. Neural networks are powerful because they learn complex, non-linear relationships directly from data rather than relying on manually written rules.
What was most interesting to me was the similarity of the perceptron to a neuron; the way perceptron is to AI is the way a neuron must be to a brain. We dive deep into the individual node of connection before being able to explore the expanse of intelligence.

Source: Medium

|
Dimension |
Perceptron (1958 Model) |
Modern Artificial Neuron |
Biological Neuron |
|
Core Computation |
z = sum of (weight × input) + bias |
Same weighted sum formula |
Electrical signals summed across dendrites |
|
Activation Mechanism |
Hard step function (outputs 0 or 1 only) |
Smooth functions like sigmoid, rectified linear unit, or hyperbolic tangent |
Fires when membrane potential crosses threshold |
|
Output Type |
Binary only |
Continuous value (for example between 0 and 1, or negative to positive range) |
Electrical spike signal |
|
Decision Boundary |
Only straight line separation |
Can form curved and complex boundaries when layered |
Not a geometric boundary; dynamic signal propagation |
|
Ability to Solve Complex Patterns |
Cannot solve problems like exclusive OR |
Can solve complex non-linear relationships when stacked in layers |
Handles complex temporal and spatial patterns |
|
Learning Mechanism |
Simple perceptron update rule based on error |
Gradient descent using backpropagation across layers |
Synaptic plasticity, long-term potentiation |
|
Gradient Availability |
Not differentiable, no smooth gradient |
Differentiable activations allow gradient flow |
No mathematical gradient; biochemical adaptation |
|
Network Role |
Standalone linear classifier |
Basic unit inside deep multi-layer networks |
Fundamental unit of the nervous system |
|
Memory |
No memory |
No memory by default, but extended architectures allow sequence processing |
Intrinsic temporal behavior and memory mechanisms |
|
Scale |
Tens to hundreds of weights |
Millions to trillions of parameters |
Roughly 86 billion neurons in human brain |
|
Biological Accuracy |
Very simplified inspiration |
Even more abstract mathematical model |
Real electrochemical system |
|
Energy Efficiency |
Extremely low computational load |
High computational load in large models |
Operates at roughly 20–25 watts in entire brain |
|
Purpose |
Early experiment in machine learning |
Foundation of modern artificial intelligence systems |
Enables perception, thought, emotion, and behavior |
A 2021 analysis in Frontiers in Artificial Intelligence argues that equating digital reasoning with biological reasoning is “probably unwarranted.” Digital systems operate with a “completely different operating system (digital vs biological)” and possess “correspondingly different cognitive qualities and abilities than biological creatures.” The authors go further: digital reasoning and problem-solving agents “only compare very superficially to their biological counterparts.”
Superficially. That word matters.
So how should intelligence be defined?
If we adopt an anthropocentric (regarding humankind as the central or most important element of existence, especially as opposed to God or animals) definition, intelligence is measured against human cognition. We define it in terms of reasoning, creativity, planning, emotion, and language as humans express them. Human cognition becomes the benchmark.

Source: ScienceDirect
But that raises a harder question.
Why do we define intelligence around ourselves? The arrogance as a species regarding our capabilities is insurmountable.
Because we are the only system with direct access to subjective experience. Our internal awareness becomes the reference point. We confuse familiarity with universality.
A non-anthropocentric definition shifts the frame. Intelligence becomes “the capacity to realize complex goals.” Under this framing, intelligence is functional. It is about goal achievement across environments. Narrow AI realizes restricted goals. AGI would realize complex goals across broad domains. Human intelligence is one instance of this capacity, not the template.
But even with that broader definition, origin still matters.
The 2024 BioSystems study emphasizes that natural intelligence is rooted in self-organization and embodiment. It is not optimized for a single metric. It is shaped by ecological integration and subjective unity. Artificial intelligence, by contrast, is stylized optimization. It maximizes loss functions. It processes probability distributions. It does not possess unified experience.
Quantitatively, AI systems can process petabytes of data per second and execute repetitive tasks with near-perfect consistency. Humans operate at roughly 25 watts of energy yet outperform AI in abstract creativity, contextual adaptation, and emotionally informed novel problem-solving. Evolution tuned humans for adaptability, not throughput.
This leads to consciousness.
Integrated Information Theory attempts to quantify consciousness through integrated causal structure. Human cortical systems exhibit high levels of integrated information. Current AI architectures exhibit near-zero intrinsic integration in this sense. They compute inputs and outputs. They do not experience them.
AI can simulate empathy through pattern recognition. It can generate language that resembles understanding. But simulation is not subjective experience. Emotions in humans are embodied states tied to survival, physiology, and memory. AI outputs are statistical predictions.
So can artificial intelligence become natural intelligence?
Even if a future system achieved self-awareness, it would remain artificial in origin. It would not emerge through biological evolution. It would not share the developmental, hormonal, or embodied grounding of natural organisms. Artificial refers to genesis. Natural refers to evolutionary emergence.
The deeper tension is not whether AI can outperform humans. It already does in narrow domains. The deeper tension is whether intelligence is reducible to computation alone, or whether its evolutionary and embodied origins confer irreducible properties such as subjective awareness and unified selfhood.
If intelligence is defined as goal realization, AI qualifies.
If intelligence includes embodied, evolutionarily grounded subjective experience, AI does not.
So the debate is not simply about capability. It is about ontology. Ontology is the study of what actually exists and what kind of thing something is at its most fundamental level.
It does not focus on what something does or how well it performs, but on what it is in its nature of being. For example, if a machine speaks like a human, ontology asks whether it truly understands or whether it is simply processing patterns. It is not a question about performance or intelligence scores; it is a question about the nature of existence. When we ask whether artificial intelligence is the same as natural intelligence, we are asking an ontological question about what kind of entity each one truly is.
In the end, are we measuring intelligence as performance? Or are we defining intelligence as a mode of being? Until that distinction is resolved, artificial and natural intelligence will remain superficially comparable but fundamentally different.[15]
[3] Moravec’s Paradox (1988), Pfeifer/Versuchel's Embodied AI (2006): human learning rewires synapses biochemically (Hebbian plasticity)
[8] Cybenko, "Approximation by superpositions of sigmoidal function" (1989): Single hidden layer + sigmoid = dense in continuous functions
[9] Cybenko's proof: Sigmoids as smoothed step functions
[10] https://john-s-butler-dit.github.io/NM_ML_DE_source/Chapter%2009%20-%20UAT/902_Universal_Approximation_Theorem.html
[11] Vapnik's Statistical Learning Theory (1998)
[12] Goodfellow et al., Deep Learning (2016), Ch. 6: Exact mathematical formulation taught worldwide
[13] Neural Networks for Financial Time Series Forecasting surveys; Renaissance Technologies production MLPs
[14] AUC 0.85-0.92: PIMA diabetes, Framingham heart risk (Kaggle benchmarks)
[15] JEH Korteling et al., "Human- versus Artificial Intelligence", Frontiers in Artificial Intelligence (2021)