In the previous unit, what I showed you was a way to rescue MaxEnt models to describe the language abundance P(n) by reference to a hidden variable, epsilon which we referred to as "programmer time" and what we assumed was that the system was constrained in two ways: it was constrained not only to having a certain average number of projects per language. We fixed the average popularity of languages, but we also fixed the average programmer time devoted to projects in a particular language. And so this distribution, here, looks like this. in functional form. And when we integrate out this variable, epsilon, we get something that looks like this. So we get a different prediction for the language distribution. And we get a prediction that, I argue, looks reasonably good. It certainly looks better than the exponential distribution. I feel honor-bound to tell you about the controversy that happens when we try to make these models... and in particular there is a very different mechanistic model that looks quite similar. So this is the Fisher log series. And the argument behind the Fisher log series, to explain this distribution, involves the idea of a "hidden" additional constraint. In the open source question, what I've done is describe that additional constraint as "programmer time", just because it seems like it might be a constraint in the system. OK? That the average programmer time for languages is fixed, not just the average number of projects. But, and so that means that languages can vary in their popularity, but also in their efficiency. In the ecological modelling, this is "species" languages are species, and this is the abundance of species. So that means the number of particular instances of the species in the wild. And also metabolic rate: how much energy a particular species consumes. And so in that case, the system is constrained to a certain average species abundance, and a certain average species energy consumption Here languages are constrained to a certain abundance, and a certain consumption of programmer energy. That's the analogy. So, we can also build a mechanistic model. of programming language popularity. And previously, when we studied the taxi-cab problem, what we did was, when we produced the mechanistic model, we were able to find a very simple one that had the same predictions as the MaxEnt model did. Here, by contrast, what we're going to find is that the mechanistic model is going to produce similar behavior, but the functional form will actually be slightly distinct. So, here's the mechanistic model... we imagine that languages all start out with a baseline popularity. Whoever invents the language, for example, has to write at least one project. There is at least one programmer at the beginning of a language's invention, who knows how to program in that language, somewhat by definition. And so there's two ways that that popularity can grow. It can grow, for example, linearly. So, on day 1 there's 1 programmer. And on day 2, that one programmer is joined by another programmer. And on day 3, those two programmers are joined by a third. And so, over time, what you have is a growth rate that's linear... in time. But, perhaps a more plausible model for how languages accrue popularity is multiplicative. At time 1, there's 1 programmer, and he has some efficiency of converting other programmers to his cause. So maybe he's able to double the number of programmers. And he's able to double the number of programmers, because his language is particularly good, and perhaps perhaps people who like to program in that language happen to be particularly persuasive. And so on the second day, those two programmers each themselves go out and convert two people, because they are the same as the original programmer in their effectiveness and the language itself is just as convincing as it was before. So each of those two programmers goes out and gathers two for each and we go to 4 And by a similar argument, we go to 8, and so this would be the exponential growth model... where the number of programmers as a function of time increases multiplicatively, as opposed to additively. So, let's make this model a little more realistic, and in particular, let's allow the multiplicative factor, which in this case we set to 2, we're going to allow this multiplicative factor to vary. And in fact, we're going to draw this multiplicative factor, alpha, from some distribution. And in fact, it doesn't really matter what that distribution is as long as alpha is always greater than 0, so it's not possible for all programmers to suddenly disappear. so it's always greater than zero, and it's bounded at some point, so it's impossible for a language to become infinitely popular after a finite number of steps. So we're going to draw... each day we're going to draw a number, alpha, from this distribution, here. So, after one day, there are alpha programmers. After two days, there is alpha... or rather alpha(1) programmers. (This is the draw on the first day.) On the second day there's alpha(2) times alpha(1) programmers, and so on. alpha(3) times alpha(2) times alpha(1) So this is now growth that occurs through a random multiplicative process. It's similar to growth that would happen through a random additive process except now, instead of adding a random number of programmers each day you multiply the total number of programmers each day, by some factor alpha drawn from this distribution. So, you can always convert this multiplicative process into an additive process, by a very simple trick of taking the logarithm. Over time, if we count programmer numbers we're multiplying, but if you're working in log space, we're just adding. We're adding a random number to the distribution, as long as alpha is always strictly greater than zero, these will always be well defined. Now, all of a sudden, it looks like the additive model in log space. And what we know from the central limit theorem, is that if you add together lots of random numbers, that distribution tends towards a Gaussian distribution. with some particular mean (mu) and some particular variance (sigma). Let's not worry about what mu and sigma are in particular, but rather note that that growth happens in log space The distribution of these sums over long time scales will end up looking like a Gaussian distribution. The average boost per day to a language looks in log space like a Gaussian distribution. What that means is that the exponential growth model with random kicks, random multiplicative kicks, actually looks like a Gaussian in log space, or what we call a log-normal in actual number space So, instead of looking at the logarithm of the popularity of the language, just look at the total popularity of the language, and what that means is that it looks like the exponential of log(n) minus some mean squared over two sigma squared... and then you just have to be careful to normalize things properly here. So, this is the log-normal distribution. And a mechanistic model where language growth happens multiplicatively where a language gains new adherents in proportion to the number of adherents it already has, where a language gains new projects in proportion to the number of projects it already has dependent upon the environment - that's where the multiplicative randomness comes from. Alpha is a random number it's not a constant. It's not 2. It's not the language always necessarily doubles. But the fact that it grows through a multiplicative random process as opposed to an additive process means that you have a log-normal growth. And so now you can say, "OK, let's imagine that languages grow through this log-normal process. And let's find the best fit parameters for mu and sigma." And if you do that, you find that the mechanistic log-normal model looks pretty good as well. We were impressed by how well the blue line fit this distribution compared to the red exponential model the MaxEnt model, constraining only N That was the red model. Here the blue model does well... this is the Fisher-log series. Unfortunately, a mechanistic model... and I've given you a short account of the mechanistic model, here. Where what's happening is you're adding together lots of small multiplicative random kicks The mechanistic model also works [?] I will tell you that this fits better. If you do a statistical analysis, both of these models have two parameters If you do a statistical analysis, the Fisher-log series actually fits better in particular, it's able to explain these really high-popularity languages better these deviations here seem larger than the deviations here, but you have to remember that this is on a log scale So this gets much closer up here than this does here. So the mechanistic model, at least visually, looks like it's extremely competitive, with the Fisher log series model derived from a MaxEnt argument Statistically speaking, if you look at these two, this one is actually slightly dispreferred. But like many people, what you want is some ironclad evidence for one versus the other. And I think the best way to look for that kind of evidence is to figure out what, if anything, this epsilon really is in the real world. If we were able to build a solid theory about what epsilon was, and how we could measure it in the data, then we could see if this here, this joint distribution, was well reproduced. If we could find evidence, for example, for the fact that these two co-vary. That here we have a term that boosts the popularity of a language if it becomes more efficient. So, if this goes down, this can get higher, and the language can still have the same probability of being found with those properties. And of course, the problem is that we don't know how to measure this, sort of, mysterious programmer time... programmer efficiency. The ecologists have a much better time with this. Because the ecologists, they know what their epsilon is. They know that their epsilon is metabolic energy units intake. So this is "how much a particular instance of this species consumes in energy over the course of a day, or over the course of its lifetime." And they're able to measure that, and in fact, they're able to measure this joint distribution. If we come to study the open source ecosystem, so far we don't really have a way to measure this and so we're unable to measure the joint and so now we're left with one model that's mechanistic, right, popularity accrual model and over here, this model that talks about there being two constraints on the system Average number and average programmer time