Shannon' s original motivation was to come up with a measure useful to quantify the channel capacity needed to send a binary message through telephone lines. It is often said that Shannon's Entropy does not quantify meaning, which is, to some degree, correct in the context of communication theory because Shannon Entropy seems not to care about the meaning of a message or its actual content. In other words, Shannon Entropy does not care whether you are trying to send a message such as "I'll meet you for lunch on Tuesday 2pm", or a possibly random-looking number such as "84592646". However, there is another angle of Shannon Entropy that suggests exactly the contrary, and it is both what makes this measure interesting but also limited. Meaning in Shannon Entropy is deeply encoded in the form of the underlying working or assumed distributions, in other words, the context in which a question related to the Entropy of some specific feature of an element in a distribution is asked. For example, if the number "84592646" in our example was rather a telephone number, the underlying distribution would not be the distribution of all alphanumerical characters of any length but, for that example, the set of valid telephone numbers. So, Shannon Entropy does distinguish between a sentence such as "I'll meet you for lunch on Tuesday 2pm" and a telephone number if the former is known to be a sentence written in English and the latter a valid telephone number. So if you have any knowledge of the ensemble for your distribution, then Shannon Entropy it is all about meaning. For example, the sentence "I'll meet you for lunch on Tuesday 2pm" only has meaning if you know that it is a sentence written in English and you know English, so by knowing that this sentence belongs to a subset of well-formed English sentences, then the Entropy of the sentence becomes significantly lower than assuming that the string is in the the space of all possible letter and word arrangements for which the Entropy would be larger. The problem with Entropy is not that it is unable to convey or capture meaning but that is it ambiguous and fragile for exactly the same reason related to probability distributions, because Shannon Entropy by itself does not provide any means to estimate the probability distributions and so it relies, in practice, on traditional statistics or the observer's beliefs or lack of knowledge. In general, one ends up using a general assumption for the uniform distribution which makes Entropy to become a trivial function becoming a symbol-counting function. Indeed, if the uniform distribution is assumed, as it is in most cases, what Shannon Entropy is measuring is the multiplicity of the different symbols used in a sequence, just as it does its counterpart measure of Entropy in physics that counts the number of possible micro-states, such as particles or molecules in a given volume of space. Leaving those arguments related to meaning and the limitations of Entropy, there are interesting properties of Entropy worth mention and studying. For example, one of the general properties of Shannon Entropy is that redundancy does not add new information as one would theoretically expect. Once fixed the number of symbols or letters, the greater redundancy the lower the entropy. For example, repeating the letter e at the end of some words does not provide any new information than the original sentence and as a function of the sequence length, the Entropy drops just as it can be seen. In this plot we can see that a growing sequence with an increasing number of trivial repetitions decreases the information content per digit or letter. In practice, with no means to calculate or make a well educated guess of the underlying probability distributions, Entropy is indeed blind to meaning and is a measure of combinatorial diversity, so a sentence may have the same Entropy as some other scrambled sentence as long as it uses the same letters. For example, these 2 arrangements of letters have the same Entropy when considering the ensemble of all possible sequences of the same length using the letters of the Latin alphabet: Now, a generalization of the concept of a letter in a message is the concept of a micro state. A micro state can be almost anything, in this case is a binary digit, which again can be seen that as long as there are the same number of 1s than 0s and assuming equal probability for all sequences of the same length, we have that a pseudo random sequence of two thousand 0s and 1s produced by the function RandomInteger[] in the Wolfram Language has almost the same Shannon Entropy than the highly structured sequence of repeating 01 a thousand times: Here is a hundred experiments producing pseudo-randomly sequences each with 100 binary digits showing how all are so close to 1. This is because the intended behaviour of a pseudo-random generating function like RandomInteger[] is to produce about the same number of 1s and 0s in a disarranged fashion if it is a good pseudo-random number generator. So the Shannon Entropy of these sequences will be log 2: with a very small standard deviation among several trials, that is, many different pseudo-random arrangements: The Entropy of a sequence of results of tossing a coin represented by the letters h or t for heads or tails. The Entropy of such sequence is maximized when the coin is fair producing about the same number of h's and t's, that is, both have equal probability 1/2 or 50:50. This is the situation of maximum uncertainty as it is most difficult to predict the outcome of the next toss in the sequence; the result of each toss of the coin is said to deliver one full bit of information when you have no idea what will come next. However, if we know the coin is not fair, but comes up heads all the time, then there is the point of less uncertainty because you have some information about the outcome, you know that it will come h or t with less or more probability. Every time it is tossed, one side is more likely to come up than the other. The reduced uncertainty is quantified by Entropy. If the coin is completely biased and always comes head, the entropy will be zero because each toss of the coin delivers no new information as the outcome of each coin toss is completely determined. This is how typical textbook example for Shannon Entropy of a random variable looks like, it is important that you remember this because when we introduce new tools in this course we will make comparisons to Entropy. So maximum fairness is reached when the coin produces the same number of heads and tails, that is 50% chances of being h or t where uncertainty is maximal. Note that the maximum value of the graph in the formula for Entropy depends on the logarithmic base and the distribution, which are two parameters for Shannon Entropy. Here, the entropy is at most 1 as we are assuming a uniform distribution and base two, for which the result is said to be in bits. From that plot, you can see how Entropy is a function of symbol density or symbol count, here are some examples of binary and trinary sequences were both the number of symbols and the number of repetitions determine their Shannon Entropy, both for natural and binary logarithms: You can see that for this particular case, when taking every bit as the micro-state for Entropy, the arrangement of the 0s and 1s is irrelevant as long as the number of 0s and 1s remain the same. Entropy values also vary significantly as a function of the number of potential symbols available. One way to tell apart cases such as a repetition of 0 and 1 in alternation or randomly arranged having basically the same Entropy such as in this example: Is by taking different micro state lengths or coarse-graining the sequence, for example, for the ordered sequence, taking units of 2 bits as the micro state for the measurement of Entropy, we get that the Entropy is 0. But if we do the same with the random-looking we now get a divergent value from the ordered case: We will later see how this can lead to what is called Entropy rate and the best version of Entropy. But it is important to see how the assumptions of the underlying probability distributions drives Entropy. For example, if Entropy values are the same for a message and its scrambled version is only because we may lack the information about the source of the message in which it was written and thus with some possibly meaning to speakers able to speak the same language. So Entropy is not only about the internal syntactic structure of a message but about how much we know about the underlying ensembles and assumed distributions and hence in a way highly epistemological although reduced to syntactic in practice in the face of complete lack of information.