A post from Shady Characters

Logarithmical: Zipf’s Law and the mathematics of emoji

This is the most recent in a series of fifteen posts on Emoji (😂). Start at PART 1 or view ALL POSTS in the series.


Back in the mists of time, I wrote about a peculiar property of words called Zipf’s Law. The idea is quite simple: the frequency at which different words occur in any large body of work, ordered from the most common to the least common, follows a predictable pattern. This is true across languages, and even in some texts, such as that in the Voynich Manuscript, that we haven’t yet deciphered.1

In formal terms, Zipf’s Law states that the frequency of any particular word — call it the nth word — is, to a decent approximation, inversely proportional to n. More specifically, the probability of encountering a word with rank n is given by 0‍.‍1‍/‍n. This means we would expect the first, most common word to account for around a tenth of all words in our corpus. The second word (where n is 2) should be around half as frequent; the third word (where n is 3) about a third as frequent; the fourth word (where n is 4) roughly one quarter as frequent; and so on.

Does it hold in reality? Well, here’s an example generated from the hundred most common words in the Corpus of Contemporary American English, or COCA,2 to find out:

Linear graph of normalised word frequencies in the Corpus of Contemporary American English, showing a rapid fall-off after the first few emoji which soon flattens out
Normalised word frequencies plotted on a linear graph. The Zipf’s Law prediction is also shown. (Graph by the author.)

COCA is a collection of texts spanning film, television, fiction, nonfiction, magazines, journals and more from the years 1990 to 2019,3 so it should provide a representative input set for our experiment. The frequency data we get from it is a little lumpy for the first few entries, but the graph soon settles down into a relatively smooth curve. The biggest declines in frequency happen right at the start and lessen in magnitude as we move from more common to less common words (which Zipf’s Law predicts will happen), and by the time we’re looking at the tenth or twentieth word, say, the changes in frequency are very predictable indeed.


In mathematical terms, Zipf’s Law is called an inverse power law. And inverse power laws have a neat property which means they show up as straight lines on log-log graphs* — and that, in turn, makes it easy to see how well a given distribution matches or does not match the predicted values. Let’s try it with our word frequency distribution, moving from linear to logarithmic axes:

Logarithmic graph of normalised word frequencies, showing that word frequencies closely mirror the expected Zipf distribution.
Normalised word frequencies plotted on a log-log graph. The Zipf’s Law prediction is also shown. (Graph by the author.)

There isn’t a perfect agreement between predicted and observed frequencies, but the two lines show substantially the same trend, at least for the hundred words shown on this graph. As expected, then, the billion or so words in COCA2 do indeed appear to obey Zipf’s Law.

It’s worth noting that there isn’t a neat, dinner-party–anecdote reason why natural language should behave like this. It’s not true to say that Zipf’s Law has no known explanations — George Kingsley Zipf himself had some ideas, and many authors since have added their own — but no one of them reliably explains why Zipf’s Law should be so uncannily predictive.4 In the end, Zipf’s Law is an artificial construct: it was invented, not discovered, and we still don’t know entirely why it works.

All this is to say that when I saw Zipf’s Law in action last time round, I wondered if punctuation might follow either the same distribution or another inverse power law like it. (Spoiler alert: punctuation does follow an inverse power law, although Zipf’s multiplicative factor isn’t quite right. After some experimentation I found that 0‍.‍3‍/‍n gave a reasonable approximation to the frequency distribution of punctuation.)

Now, though, I have a new question. What about emoji? I’ve been wondering for a while whether Zipf’s Law applies to emoji, and I’ve finally had some time to put the idea to the test.


My first job was to find some frequency data for emoji. Reliable sources on this subject are hard to come by,§ so I fell back on a set of statistics made available in 2022 by the Unicode Consortium, which manages the emoji lexicon.5

Here are the relative frequencies of the top fifty emoji of the fifteen hundred or so in Unicode’s stats:

Linear graph of normalised emoji frequencies, showing a rapid fall-off after the first few emoji which soon flattens out
Normalised emoji frequencies plotted on a linear graph. The Zipf’s Law prediction is also shown. (Graph by the author.)

You’ll notice a Zipf’s Law curve there, and you’ll also notice that it seems to correspond pretty well to the underlying emoji data. If we again replot our data using log-log axes, we get the familiar straight line of Zipf’s Law and a surprisingly consistent set of emoji data. If anything, Zipf’s Law fits emoji better than it fits the words of the COCA corpus:

Logarithmic graph of normalised emoji frequencies, showing that emoji frequencies closely mirror the expected Zipf distribution.
Normalised emoji frequencies plotted on a log-log graph. The Zipf’s Law prediction is also shown. (Graph by the author.)

The odd thing is that I plotted the Zipf curve on these emoji graphs without even thinking about it. I used exactly the same equation, with no tweaks, as on the word graphs above. Punctuation may not be entirely Zipfian, but emoji are startlingly so — according to my crudely unscientific experiments here, at least.

I write in Face with Tears of Joy that emoji are not a language in their own right. Too many things are missing: parts of speech; syntax; even agreed meanings for individual emoji. Conforming to Zipf’s Law doesn’t change any of that. Yet at the same time, emoji are starting to display certain language-like traits. They follow a recognisable word order in some limited contexts.6 They can be used as metaphors, as exclamations, as sounds, as pictures, or as symbols.7 We can string them together to convey more information than they can on their own. That they seem to follow Zipf’s Law on top of all this tells us that there may just be something interesting going on with emoji.

1.
Landini, Gabriel. “Evidence of Linguistic Structure in the Voynich Manuscript Using Spectral Analysis”. Cryptologia 25, no. 4 (October 1, 2001): 275-295. https://doi.org/10.1080/0161-110191889932.

 

2.
wordfrequency.info. Accessed July 18, 2025.

 

3.
wordfrequency.info. “Corpus of Contemporary American English”. Accessed July 17, 2025.

 

4.
Piantadosi, Steven T. “Zipf’s Word Frequency Law in Natural Language: A Critical Review and Future Directions”. Psychonomic Bulletin & Review 21, no. 5 (October 2014): 1112-1130. https://doi.org/10.3758/s13423-014-0585-6.

 

5.

 

6.
Herring, Susan C, and Jing Ge. “Do Emoji Sequences Have a Preferred Word Order?”. Proceedings of the National Academy of Sciences 108, no. 42 (October 18, 2011): 17290-17295. https://doi.org/10.1073/pnas.1113716108.

 

7.
Schnoebelen, Tyler. “Cher Is the Queen of Emoji Even If She isn’t”. Medium (blog).

 

*
Which you know all about because you have read and internalised chapter 3 of Empire of the Sum, right? 
I chose to limit the graphs here to a hundred words each so that the initial decline in frequencies is more easily seen. You’ll have to trust me when I tell you that the hundred and first word, and all those which come after it, also agree pretty well with Zipf’s Law. 
A couple of recent developments offer some hope: the Emoji Stats for Bluesky site looks promising, as does Emojipedia’s resuscitated emojitracker.com
§
I used the same set of Unicode statistics for the Shady Characters Periodic Table of Emoji

2 comments on “Logarithmical: Zipf’s Law and the mathematics of emoji

  1. Comment posted by Barb on

    I didn’t get too far in your article before I had a thought/question. As I picked up my kindle to read, I wondered, would you consider a kindle to be a codex or a scroll? Or, as Monty Python would say, something completely different? Hmmmm…

    Thank you.

  2. Comment posted by Keith Houston on

    Hi Barb – that’s a good question! Perhaps both or perhaps neither, although Kindles clearly want to ape the codex.

Leave a comment

Required fields are marked *. Your email address will not be published. If you prefer to contact me privately, please see the Contact page.

Leave a blank line for a new paragraph. You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>. Learn how your com­ment data is pro­cessed.