Introduction

In information theory, the entropy of a random variable represents the average level of "uncertainty" inherent in the random variable's possible outcomes. First introduced in Claude Shannon's 1948 paper "A Mathematical Theory of Communication", entropy is widely used in machine learning and statistics and is a crucial concept to understand for machine learning practitioners.

In this article, we'll be exploring the intuition and formalization behind entropy. Let's approach it with a beginner's mind, and try to understand it from first principles.

Thought Experiments

Before diving into any technical details, let's first attempt to explain the phenomenon we are formalizing.

Weather Patterns in Three Cities

Imagine a scenario where we have three cities and different weather patterns - a digital transmitter is placed in each city and sends messages informing us of the weather.

Below is the probability distribution for each city's weather pattern:

City A: 100% sunny
City B: 50% sunny, 50% rainy
City C: 10% sunny, 10% rainy, 20% cloudy, 30% stormy, 30% snowy

On average, how much information do we receive in a message from each city?

This is the exact question that the shannon entropy helps us answer.

Using common sense, we can conclude that City A's signal is not useful, since it's always sunny. City B's signal tells us more information than A, but if we were to blindly guess, we'd be right half the time considering there's a 50/50 chance it's either sunny or rainy. City C's weather is highly unpredictable, so each message telling us the weather would contain the most information.

Ranking: City A < City B < City C

Lottery vs. Weather Prediction

Let's entertain another thought experiment. Imagine we receive two messages, the first message tells us if it will rain tomorrow, and the second message tells us the exact 12-digit sequence that will win the lottery of $500M USD tomorrow. Which message contains more information? (Hopefully, you are thinking "obviously the 12-digit sequence!")

The lottery message contains more information because the probability of guessing the correct 12-digit sequence is much lower than guessing whether it will rain tomorrow. Rare events pack a bigger informational punch!

The Relationship Between Information & Probability

From the simple thought experiments above, we can observe a couple of important properties:

For any given probability distribution, the more uncertainty it contains, the more amount of information we receive.
For any given event, the less likely the event is to occur, the more information it contains.

Point 1 is obvious as described in our previous examples, but Point 2 really drives home the inverse relationship between probability and information content.

With these two observations in mind, let's now explore the mathematical formalization.

Mathematical Formalization

The Shannon entropy, $H(X)$ , is defined as:

H(X) = -\sum_{i=1}^{n} P(x_i) \log_b P(x_i)

where:

$H(X)$ : The entropy of the random variable $X$ , which quantifies the amount of uncertainty or randomness in $X$ .
$\sum_{i=1}^{n}$ : A summation over all $n$ possible outcomes of the random variable $X$ .
$P(x_i)$ : The probability of the $i$ -th outcome of the random variable $X$ .
$\log_b$ : The logarithm to the base $b$ . Commonly, $b$ is 2 (binary logarithm), e (natural logarithm), or 10 (common logarithm), depending on the context or field of study. In information theory, base 2 is commonly used, as entropy is often measured in bits.
$x_i$ : The $i$ -th outcome of the random variable $X$ .
$n$ : The total number of possible outcomes of the random variable $X$ .

Let's break down the two main parts here.

Information In A Single Event - $-\log_b P(x_i)$

As mentioned earlier, for any given event, the less likely it is to occur, the more information it should contain. Taking the negative logarithm of the probability helps us express this intuitively. Lower probability events contain more information, and higher probability events contain less information. The negative sign ensures entropy is always non-negative.

Weight Of Event (the probability of that event occurring) - $P(x_i)$

Since we want to measure the average amount of information in a message that describes an entire distribution, we should weight each event by its probability of occurring.

Example Calculation

Let's calculate the entropy for the weather distribution in City B:

Sunny: $P(x_1) = 0.5$
Rainy: $P(x_2) = 0.5$

Using the formula with base 2 logarithm:

\begin{align*} H(X) &= -\sum_{i=1}^{2} P(x_i) \log_2 P(x_i) \\ &= -(0.5 \log_2 0.5 + 0.5 \log_2 0.5) \\ &= -(0.5 \times -1 + 0.5 \times -1) \\ &= 1 \end{align*}

The entropy of City B's weather distribution is 1 bit, meaning that on average, each message from City B contains 1 bit of information.

Finally, let's put what this equation does in English.

The Shannon entropy calculates the sum of the negative log probability of each event occurring, weighted by its probability, effectively measuring the average unpredictability or information content of a set of events.

Why the Logarithm?

Logarithms have some nifty properties that make them perfect for quantifying information:

Additive Properties:
- Logarithms are additive, meaning $\log_b(x \times y) = \log_b(x) + \log_b(y)$ .
- This property allows us to compute the total information of a probability distribution by simply taking the sum of the information content of each event.
Intuitive Numbers:
- Logarithms give us intuitive numbers to work with, even for extremely tiny probabilities.
- For example, the negative log base 2 of 0.0000000005 is about 30.9, which is way easier to wrap our heads around!
Encoding Schemes:
- In different encoding schemes (binary, DNA, ASCII), each extra position scales the total possibilities by the number of unique elements in the set.
- This scaling is conveniently represented by the base of a logarithm.
- In computing, we commonly use bits, so it's natural to use $\log_2$ . For other encoding systems, we just change the base of the logarithm to match.

As Claude Shannon himself put it:

The logarithmic measure is more convenient for various reasons:
It is practically more useful. [...]
It is nearer to our intuitive feeling as to the proper measure. [...]
It is mathematically more suitable. [...] — Claude Shannon, "A Mathematical Theory of Communication," 1948

Applications

Entropy has a ton of applications across different fields:

Machine Learning: Entropy is used in decision trees to measure the impurity of a split and in the construction of features for model training.
Data Compression: Entropy provides a theoretical lower bound for the average number of bits needed to encode a message, which is used in compression algorithms.
Communication Systems: Entropy is used to quantify the capacity of communication channels and to design efficient coding schemes.

Final Thoughts

A solid understanding of entropy is crucial for anybody interested in machine learning. In building decision trees, we can reduce the entropy of the dataset through recursive splitting. In deep learning and neural nets, cross entropy is a common loss function used for gradient descent in classification problems. Finally, it's simply a beautiful concept and will help us expand our minds to better grasp the world around us.

I hope this article has helped you gain a deeper understanding of Shannon entropy and its significance in the field of machine learning. Remember, entropy is a measure of uncertainty or randomness in a probability distribution, and it helps us quantify the average amount of information contained in a message or event.

Thanks for reading!

Glacius Blog

A Mathematical Explanation of Shannon Entropy