So we saw in the last lesson how Entropy is defined to quantify the apparent disorder of a room with particles and how it can provide statistical information on the average behaviour of those particles in the form of a description of their statistical properties. In information theory a way to study particles is by using symbols forming objects such as sequences or messages and the question is transformed into how ordered or disordered a sequence of symbols looks like instead of a room with particles. In the context of messages written in some human language, the equivalent of a room would be the space of all possible well-formed sentences in a language. For example, the following sequence of bits looks highly ordered: because it is just a repetition of the same symbol and there is very little surprise when looking at this sequence and finding out that more and more 1s are coming out. However, this other sequence looks less ordered. It is also said that it looks more typical because it is the kind of sequence one would expect by chance, yet every bit comes as a surprise because no information from the first segment of the sequence gives any information about what is to come next . This is what one would expect in, for example, the result of tossing a coin with 0 corresponding to heads and 1 to tails, with each toss independent of each other and not telling anything about the next result. Notice that just like in the case of the room in the previous lesson, this kind of order and disorder is of statistical nature, that is, how symbols seem to be distributed along the sequence of bits. Two sequences will have the same Entropy if they have similar statistical properties and they will have similar statistical properties if they have the same proportion of symbols relative to the occurrence of appearance of all other possible symbols. In the case of a binary sequence, for example, if a binary sequence has about the same number of 1s than 0s then the sequence will be assigned high Entropy. Here is an example of a binary string with four 0s and six 1s. And here is a sample of 10 strings from permutations of the bits of the same string. All these strings will have the same Shannon Entropy in the set of all possible binary strings of the same length because all these strings have exactly the same number of 0s and 1s than the original: The formula to calculate Shannon Entropy denoted by the letter H of a string or sequence s, is traditionally written as follows: H(s) is equal to minus the sum of the probabilities of the occurrence of a symbol i in s, denoted by Subscript[s, i], multiplied by the logarithm of the same probability over all i symbols among n possible. We can illustrate its application with the following example: Let's take as a sequence something written in English, such as 'this is not a sentence' and look at its distribution of letters. We can then count how many times each symbol, in this case each letter, occurs in the sentence hence effectively producing a discrete distribution of letters from that message. We have that the letter 'a' occurs once, just as 'c', 'h' and 'o', followed by the letter 'i' appearing twice and then 'e', 'n', 's' and 't' appearing thrice in the sentence, and the space 4 times in the sequence: And the distribution looks like this with no particular order on the X axis. It does not look very special but it also does not look completely uniform and thus it means that it may be conveying some information: Then for every symbol Subscript[s, i], in this case a letter, we can calculate each probability of occurrence in the sentence: To finally multiply by the logarithm of the probabilities and take the sum over all the elements. We have seen thus how to arrive at the Entropy function that we can write in the Wolfram Language as follows: And we can directly get a value for the entropy of our example sequence. In this case, our sequence of letters has Shannon Entropy about 3.14 So, if the distribution is not typical in the sense of looking random which would mean to see every letter with the same frequency what does the distribution and this Entropy value tell us? Actually the sentence in English is telling us a few things, despite being so short. It is already suggesting that among the most frequent letters in English there is the letter 's' and 'e', for example. So you can see how a non-uniform distribution conveys some information about the underlying rules that made that sequence to be a sentence written in English. We will later see more examples of this. The reason that logarithm is used in the formula for Entropy is because, besides having some useful mathematical properties such as being the inverse of exponentiation, the result of integrating 1/x, and transforming multiplication into addition because the logarithm of a product of logarithms is the sum of the logarithms: the logarithm also helps dealing with large numbers, as it is some sort of natural mathematical compression method that works by changing the alphabet base helping encode information. Let me explain with the following example: Imagine that you were asked to guess a number x and that the only information you had is that such a number is in some interval 1 <= x <= N. In this situation, you could ask one by one whether it is the number 1, or the number 2, the number 3 and so on up to 100 and in average if the number was picked at random you would need about 50 questions in average to guess the correct number. However, the optimal algorithm for guessing a number is something called a binary search algorithm, which finds the number x in time Subscript[log, 2]N in average, also denoted by the O notation, O(Subscript[log, 2]N), where O stands for 'order' like in the order of some quantity. So let's say I tell you the number is between 1 and 100, the Subscript[log, 2] of 100 is 6.64, so rounding is 7. It means that in about 7 yes and no answers (hence binary) you may be able to guess such a number. So, let' s say the number is 17, then you could ask the following questions: is the number... (1) between 1 and 50? In this case, yes (2) between 1 and 25? Yes (3) between 1 and 13? No (4) between 14 and 20? Yes (5) between 14 and 17? Yes (6) between 16 and 17? Yes (7) is it 16? No It is therefore 17 and it could have taken only 6. This strategy always work, try it for yourself picking some other number. So the Subscript[log, 2] of an integer gives the number of binary questions that can encode any number in a decimal representation. Can you think of other binary questions that lead to the same number of answers to guess correctly? Do all binary questions lead to the same number of guesses? The Wolfram Language has its own Entropy function that you can use by typing just: The first input parameter in the Entropy function is the base, which will change the base of the logarithm in the formula. We will usually take base 2 as we are interested in bits in this course unless otherwise established, for sentences in human languages we can take the natural logarithm. You can verify that my function is the same as the function implemented in the Wolfram Language: