Soroush Abbasi Koohpayegani |
I named this one "Entropy". It resembles a tree in nature or has a structure similar to UNet in neural networks.
UNets have been utilized in various models such as auto-encoders, where the objective is to represent any distribution in a compressed distribution
(latent space) with a lower dimension. In recent diffusion models, UNet has been employed to progressively transfer the distribution of noise to natural images.
We can observe that distribution transfer occurs in nature as well. For instance, trees transport elements from the soil to form
complex structures like leaves and fruits.
Nature consistently shifts distributions from higher entropy to lower entropy states.
This gradual change is similar to diffusion models, where each iteration only partially denoises the image. In our world,
the distribution of specific states is intricately linked to the previous distribution. For instance, the presence of life depends
on the existence of oxygen in the preceding distribution. Consequently, the probability of life increases with the presence of oxygen in the previous state.
In the beginning, entropy is at its highest—pure random noise, with nothing known.
What does the picture look like? Just a boring noise. At the end, entropy is at its lowest, where everything is known,
presenting a clear picture devoid of any room for imagination. It will remain the same boring picture from that point onwards.
Somewhere in between, where entropy is balanced, lies the middle ground of the dance between high entropy and low entropy—an exciting distribution to live in.
It's predictable enough to maintain stability, yet unpredictable enough to keep things exciting.
This represents a balance between chaos and order. These two states mirror each other, both being boring but in two different distributions.
Some additional details: I draw inspiration from the beauty of Turquoise, and this amazing video by
Derek Alexander Muller.
Download the
high resolution.
For me, machine learning involves the transfer of data distributions. Sometimes, our interest lies in sampling from a distribution,
while at other times, we aim to transfer one distribution to another with desired properties.
For instance, in discriminative tasks like image classification, our goal is to transfer the distribution of natural images to a compact distribution where images
can be discriminated based on specific criteria. Each layer of a transformer network can represent a step that transforms the input
distribution into another distribution until reaching the final layer, where the output is from the target distribution.
In generative models, our interest lies in sampling from specific distribution.
The difficulty lies in the fact that not all distributions are equally probable to sample from, and we need to sample from those distributions to train generative models.
Structures with high complexity (low entropy) are typically rare because their existence is conditioned on previous distributions. As a result, sampling them presents a challenge.
Conversely, structures with low complexity (high entropy) are usually common and have a high probability of existence, thus making them easy to sample.
Most of the distributions that we are
interested in have low entropy and are rare. Examples include the distribution of natural images, the weights of a neural network
that provide minima of a function, or the solutions to a puzzle (e.g., jigsaw, rubik's cube).
There is an asymmetry in the difficulty between constructing a complex structure and its destruction.
Transitioning from a distribution with low entropy to one with high
entropy is straightforward. A random walk on states can increase the entropy and remove information, and since high entropy distributions are
more likely to occur, we eventually end up in those distributions after enough steps.
For example, if we randomly change the state of a Rubik's Cube or add random noise to an image, eventually we end up with a state where the entropy is high and the appearance resembles noise.
If we could reverse time for the destruction process, we could sample a from a known distribution with high entropy (e.g., guassian noise) and then reverse time to return to the original distribution.
However, since we don't have access to the function that performs the destruction, we cannot reverse time using the inverse of that function.
Nevertheless, we might be able to learn the reverse process by parameterizing that function with a deep neural network.
This entails learning to reverse time.
To learn the reverse process, we need data. We can sample from the target distribution and elevate the entropy within them through
destructive steps. With a carefully designed distructive steps, after a few iterations, this process transfer target distribution to known distribution.
Subsequently, we aim to learn to reverse this process—a function that, when given a known distribution, transfers it to the target distribution.
The critical aspect here is the ability to sample from the target distribution.
In general, I'm deeply intrigued by discovering all complex structures with the lowest entropy in the universe and model the distribution.
To appreciate the remarkable nature of a low entropy states, consider the probability of you, as a human,
existing and reading this webpage on your computer, given the initial state of the universe—a universe with maximum
entropy and pure randomness of basis elements.
The hourglass, is similar to UNet, serves as a destruction function. The sands on the top part can originate from any distribution.
Once a sample from the target distribution is placed on top and the function is executed, it transforms any distribution into
a normal-like distribution on the bottom part, a know distribution. I wish I could sample from the universe and place it atop the hourglass and learn the reverse function.