It is important to understand how entropy works in the face of multiple variables, because one of the main applications of classical information theory is to try to find common or shared statistical properties across different objects. And, while we will use some basic concepts from entropy, we will also be replacing it in some applications with measures of algorithmic nature as opposed to measures of statistical nature - such as entropy. Others have already done a great job at describing the many versions and types of applications of entropy, so this is only a brief overview of the ways in which entropy approaches and statistical information can be helpful for our own discussion when moving to algorithmic complexity. In traditional information theory, one of the things that can be done is to ask what kind of similar statistical properties two or more objects may share. Joint entropy is a measure associated with a set of random variables. A random variable can be a binary sequence, and the events of a sequence are its bits. Joint entropy is simply the entropy of two variables, and, in general, for any number of variables it can be written as follows... For example, you may want to know the entropy of two random variables related to weather. Let's say that we are interested in whether today will be a sunny or rainy day; and whether it will be hot or cool; and that we have empirically calculated these probabilities, based on previous days' and previous years' records. Then, the joint entropy of the variables X and Y, according to the joint entropy formula, is as follows... The value indicates that these variables are not independent, because the joint entropy is different to the entropies of the single variables alone. This is because there are greater chances to be a hot day if it is sunny than if it is rainy. And, if it is rainy, it is also more likely to be cooler - according to the empirical distributions that we assumed and we calculated in this example. Something important to notice is that to apply entropy one needs to be able to properly calculate those joint distributions. Now, if you want to know how much of the statistical properties of one sequence, X, you can guess from the statistical properties of another sequence, Y, then we are talking about what is known as "conditional entropy." Conditional entropy quantifies the amount of information needed to describe the outcome of a random variable, Y, given the value of another random variable, X. For example, let's say we have two sequences as follows... Their conditional entropy should be very high because the sequences have very similar properties. You can write a function for conditional entropy in the Wolfram language as follows... There is also an undocumented function in some versions of the Wolfram language to calculate only conditional entropies, but you should not tell Wolfram I told you about it, and it may not be in all the platforms of the Wolfram language and versions of Mathematica, but you can try it. As extreme cases, we have that the conditional entropy - denoted by H of Y given X - is equal to zero, if and only if, the value of Y is completely determined by the value of X. Conversely, the entropy of Y given X, is equal to the entropy of Y alone, if and only if, Y and X are independent random variables - which means that X does not provide any information about Y, and thus, the entropy of Y given X is simply equal to asking for the entropy of Y directly alone with no access to X. The mutual information of two random variables or random sequences, as defined in terms of Shannon entropy and denoted by the letter R, is a measure of the variables' common statistical properties. That is, the amount of statistical information that one variable may tell about some other. In terms of joint entropy, "I" can be easily defined by the joint entropy of X and Y, minus the conditional entropy of X given Y, minus the conditional entropy of Y given X. That is, all the information that can be known from knowing each variable or sequence from each other. Mutual information is "commutative," meaning that the order of the sequences do not matter - whatever you can guess of X from Y you can do so for Y from X. This will be a problem that entropy carries with it. The fact that there is a symmetric function means that it does not allow you to determine a causal direction when dealing with causality, but only correlation. There are ways to fix this by introducing time, but this will only partially fix the problem because entropy will miss anything that is not of a statistical nature, as we will later see, such as even algorithmic connections between the variables. Conditional entropy and mutual information can be thought of as complementary operations. The joint entropy of Y... given X is a measure of what X does not say about Y, or the amount of information remaining about Y after knowing X. And, the mutual information between X and Y is a measure of what Y says about X, or how much information X shares with Y. You will find this very useful Venn diagram in most textbooks about information theory, explaining each of these measures. And, in science in general, we are interested in causal relationships. That is, what is the cause of events such as - what is the cause of lung cancer? Do people smoking have greater chances of developing lung cancer? As you can see, this sounds very much like mutual information and conditioning. For example, you can think of whether knowing that someone smokes two boxes of cigarettes every day will tell you anything about that person getting cancer. And, one can calculate all the joint probabilities and apply information theory to get an answer. Imagine that you perform an experiment where you find out that people that are more stressed also have greater chances to develop cancer. But, what happens in reality is that people that smoke are more likely to smoke more when they are stressed. So, at best, the stress is only indirectly causing cancer - but it is not a direct cause, because there may be stressed people that do not relieve their stress by smoking and thus increasing their chances to develop lung cancer. It is often the case that one cannot isolate these causes and cases from some studies because of lack of control over the variables. But, using conditional mutual information, one may sometimes help tell direct causes from indirect causes. For example, to know what is statistically more informative - to know if being stressed or being a smoker increases the risks of developing lung cancer is something we can explore with classical information theory and Shannon entropy. We basically want to disentangle the following causal scenario in which we believe that it is smoking that is the direct cause, but we want to test if being stressed is equally informative or only an indirect cause by way of stressed people that decide to smoke. The first thing to notice is that we have three random variables, which in this case may be binary - but they can be given weights too. So, the variables are - to be a smoker, which can have an answer yes or no; to have the condition of being stressed; or to have lung cancer. So, if we want to test whether it is to be stressed or to be a smoker is the greater risk in developing lung cancer, we can write both statements as follows... Then this specific test may fail to tell the direct cause and we would need to come up with a more sophisticated case. One question will be if such equality would tell us something about the entropy of being a smoker given that it is stressed, or the other way around, clearly if age of a smoker given stressed is equal to being stressed given to being a smoker, then both would be equally informative. Would it imply that virtually every smoker is stressed and the other way around? Maybe you can think about it. We may know that the uncertainty or the degree of surprise of having cancer being a smoker is much less than the surprise of uncertainty of having cancer after being stressed - which means that it is more likely to develop cancer by virtue of being a smoker than by being stressed. We know, however, that people stressed and smokers are not these joint sets of people, because many stressed people decide to smoke. So, think if it makes sense to ask about the mutual information of cases, such as the mutual information between cancer and a smoker being stressed versus only the mutual information of cancer and a smoker... and what values of which would tell us something about the relationship among these three variables. We can see how classical information theory and Shannon entropy provides a nice language... to deal with probability distributions and frame these kind of questions in a statistical framework, and one can also see how good the answer fully depends on how much we know, or how much we can approximate, all the empirical distributions involved. [ end of audio ]