So now instead considering to masses of books we are gonna consider something much more interesting and subtle I think surprising . The frequency of words inside of books So here is the book. And different words appear in it “were” and “move” and “positions” and so on. And some of these words are very common and appear quite frequently. And other words are much rarer. So what would the distribution of these frequencies look like. So I wanna illustrate this by thinking not about entire book, but just a short quotation first. So here is a passage from the writing of Henri Poincare, mathematician who helped found chaos and dynamics. And he writes, and I should mention that I have a pdf of this slide in the next one in case this is not readable or you wanna print it out and work through this on your own So here’s the passage, The scientist does not study nature because it is useful; he studies it because he delights in it, and he delights in it because it is beautiful. If nature were not beautiful it would not be worth knowing, and if nature were not worth knowing, life would not be worth living. So let’s think about how often certain words appear here. So the word “it” appears 6 different times in this passage. Let’s see. if I can find them all 1,2,3,4,5,6 So there is 6 occurrences of the word “it” The word “not” appears 5 different times. And then there 4 words that appear 3 times. So the “nature” appears 3 times. Let’s see. 1,2,3 there are the 3 instances of “nature”. There are a bunch of words 10 of them that appear twice. And then there are 8 words that appears once. So I got a bunch of different words with different frequencies So what I am gonna do now is considered not the word themselves but just the frequencies. So there is one 6, one 5 4 threes 8 twos excuse me 10 twos and 8 once So the world frequencies then are these. one 6 and 5 4 threes and so on And so I could take these data here. And I could turn that into a histogram. Like just count the frequency of occurrences of these different numbers. So let’s see what happens if I do that. So I wanna make a histogram. There is one occurrence of 6 So the histogram for 6 is gonna be 1 There is one occurrence of 5 And 0 fours So this is gonna look like this 0 fours There are 4 threes 1,2,3,4 Then we had 10 twos And 8 once So this histogram has just arrived from these data which is derived from counting words frequencies. So what this means is that in this passage There are 8 words that appear once. There are 10 words that appear twice. 4 words that appear 3 times. 0 words that appear 4 times. And there is one word that appears 5 times . And one word that appears 6 times. So what we’ll do next is we consider word frequencies not for couple sentences but for much much longer piece of text. So we’ll consider the word frequencies from the novel Moby Dick. So this is a novel by Herman Melville. I have to confess I’ve never read it and to be honest I probably never will. But I have made plots of the word frequencies of it. may be that’s little bit more fun. But any way So Moby Dick total number of words is about 210.000 And the different number, the number of different words is about 18.800 So let’s think about what we might expect Are we gonna see that may be most words are used 3 or 4 times? or most words are gonna be used only once or twice. Well, let’s see. So here’s a result of making a histogram of word frequencies for Moby Dick. The source for this data I’ll put out information here and put a link also on this, on the page for this video. So here’s what this says. So this is the frequency. And then this is the number of occurrences. So this says There are Let’s see 2, 6 something like more then 8000 words that appear only once. There are around 3000 words that appear twice. There are few words still that appear 3 times and 4 times and 5 times. So most words in this novel are the most common frequency is 1. And then it drops off very quickly. So when you make a histogram you can count the number of occurrences you can also think in terms of probability. So here’s the same data. And all I’ve done is I’ve just plotted the y axes in terms of probability. So here’s what this says of the 18.000 or so unique words different words that appear in Moby Dick almost half of them appear only once. So I think that’s surprising and really interesting And then about 18 percent of them, all these words appear only twice and so on. So we’re seeing something that drops off very quickly. And has a very big peak for the small value 1 What does this look like if I go out further still Well, let’s see. So here I am plotting not the first 30, not 30 but out of 300 so how many words appear 300 times in the novel, or 250 times, or 260 times Well, clearly not very many. May be it's 0, it’s hard to tell. Because the scale of this plot, things get scrunch down a lot. So what we’ll do is we’ll look at log-log plot of this. So what I am gonna do is just take the logarithm of the frequency the logarithm of the probability and I am gonna plot those. So here’s what happens if I do that So this is the log of the frequency. And this is the log of the probability. And we can see that there is a straight line here to a pretty good approximation and so we learned in the last unit that when we see a straight line like this we expect, well I say that is we saw the straight line a lot and we were doing box counting. So we would expect that this would be described by a formula similar to that box counting formula. And we’ll take a look at that in a second. But first let me just point out few more things about this plot. So again down here this says that there are a lot of words that appear only once. But what about this word. This is the most common word. It has the largest frequency and let’s see this is for Moby Dick the word “the” And it appears 14086 times. So we have a lot of words that appear exactly once And then we have one word “the” that appears 14000 some odd times. That’s a really big range of frequencies of a lot bigger than the range of book masses that I had before. Let’s see. The second most common word just in case your curios is “of” Then what we’ve got “and” and “a” and the word “of” appears 6414 times. And you could look the data file and get the rest of these. So there are few words in Moby Dick and in English in general, And this is a general feature of languages that appear very very often. And then there are lots and lots and lots and lots of words that appear very rarely in this particular novel only once.