In the previous section I walked you through a whole set of mathematical derivations designed to produce a maximum entropy model of essentially a toy problem, the toy problem is how can we model the distribution of arrival times of New York City taxicabs based on a small set of data, and that's a problem that well might be of interest to you personally, but is certainly not of profound scientific interest. In this next section, what I am going to do is present to you a reasonably interesting scientific problem and show how the maximum entropy approach can illuminate some interesting features of that system. The particular system we have in mind is what people have tended to call the open source ecosystem. It is the large community of people devoted to writing code in such a way that it is open, accessible, and de-buggable, and in general, produced not by an individual, and not by a corporation, with copyright protections and control over the source, but rather by a community of people sharing and editing each others' code. The open source ecosystem is something that dominates a large fraction of the computer code run for us today, including not only Mac OS X, but of course Linux. It is a great success story and we would like to study it scientifically. I used the word ecosystem advisably in part because a lot of what I am going to tell you now is a set of tools and derivations that I learned from John Harte and people who have worked with John Harte on the maximum entropy approach to not social systems, but to biological systems and in particular ecosystems. John Harte's book "Maximum Entropy and Ecology" - I recommend it to you as a source of a lot more information on the kinds of tools that I am going to show you now. My goal here is really to show you that even simple arguments based on maximum entropy can provide some really deep scientific insight. What I am gonna take as my source of data, because I am going to study the empirical world now, is drawn from source forge. Source forge is no longer the most popular repository of open source software - perhaps Github has now eclipsed it - but for a long period, perhaps from 1999, and we gather data up to 2011 on this, it has an enormous archive of projects that range from different kinds of computer games to text editors to business and mathematical software, some of the code that I've used in my own research is put up on Sourceforge. It is a great place to study in particular the use of computer languages. Here, what I have plotted is a distribution of languages used in the open source community and found on Sourceforge. On the x-axis is the log of the number of projects written in a particular language. You can see that log zero, that is one. In the database there are about twelve languages that have only a round of order one project. These languages are extremely rare, in other words, in the open source movement. Conversely, on the other end of this logarithmic scale here at four, so, ten to the four that's ten thousand, we see there is a small number of extremely popular languages, these are sort of the most common languages you will find on Sourceforge they have a run away popularity, ok? And if you know anything about computer programming it will not surprise you to learn that these are languages mostly derived from C, such as C, C++, Java, ok? Somewhere in the middle between these extreme rare birds and these incredibly common, you know, because these are sort of like the bacteria of the open source movement, somewhere in the middle you have a larger number of moderately popular languages, ok? So this distribution of languages is what we are gonna try to explain using maximum entropy methods, ok? there is a small number of rare languages a larger model of moderately popular ones and then again a very small number of wildly popular languages, ok? So I plotted that as a probability distribution, in fact, P of n where n is the number of projects in the open source community that use your language, and this is the probability that your language has n projects in the open source community, what we would like to do is build a maximum entropy model of this distribution here, I represented the same data in a slightly different way, this is how people tend to represent it, this is a rank abundance distribution, what ecologist call a species abundance distribution. So the top rank language, rank one, here, is the language with the most number of projects, and it won't surprise youth lear that it actually turned out to be Java, there is 20 thousands projects written in Java, you can see the second rank language C++, then C, then PHP, and this far rare languages down here have much lower ranks, so higher numbers means lower ranks, like in 3rd place, 4th place, so a 100th place down right here on the 100th place is the very unfortunate language called Turing, which has only as far as I'm aware in the archive only two projects are written in Turing, and you can see some of my favorite languages like Ruby are somewhere here in this kind of moderate popularity zone. So we represent that data, is the same data, now what I just plotted here is log abundance on this axis and here is the language rank but is a linear, as oppose to here, where I showed you the log, ok? this is a log-log plot, this is know a log-linear plot, so this is the actual data. So, the first thing will try is a maximum entropy distribution for language abundance, in other words, for the probability of finding a language with n projects, and what we are gonna do is we are gonna constraint only one thing, the average popularity of a language, this is gonna be our one constraint for the maximum entropy problem, and we are gonna pick the probability distribution p of n that maximizes the entropy p log p negative sum, and from zero to infinity of p log p, where gonna maximize this quantity subject to this constraint, and of course, always subject to the normalization constraint, that p(n) is equal to unity, ok? So that's my other constraint, and of course, we know how to do this problem already, we know the functional form, is exactly the same problem as the one you learnt to do when you modeled the waiting time for a New York City taxi cab, ok? You modeled that problem exactly the same way, so I'm only gonna constraint the average of the waiting time, here we're only gonna constraint the average popularity of a language, popularity mean the number of projects written in that language has on the archive, and so, we know what the functional form will look like, it looks like something like e to the negative lambda n, ok? all over Z, and then all we have to do is fit lambda and Z, ok? So that we reproduce the correct abundance that we see in the data, so this is the maximum entropy distribution, it's also, of course, an exponential model, has an exponential form, and if actually find the lambda and Z that best reproduce the data, in other words, that best satisfies this constraint, and are otherwise maximum entropy, here is what we find, ok? This red band here is the 1 and 2 sigma contours for the rank abundance distribution, and all I want you to see on this graph is an incredibly bad fit to this. The maximum entropy distribution with this constraint, does not reproduce the data, is not capable of explaining, our modeling, the data in any reasonable way, it radically under predicts these really extremely popular languages, it's unable to, in other words, reproduce the fact that there are languages like C and Python, that are extremely popular, it over predicts this kind of mesoscopic regime, it over predicts the moderately popular languages, right, as also it does over predict those those really rare birds, those really low rank languages, with very few examples in the archive, so this is right, this is science fail.