A Comedy of Errors: Shakespeare's Neologisms and their varying usage between 1800 and 2000¶

by Elle Cawtheray¶

Introduction ¶

William Shakespeare: dramatic virtuouso, cultural icon, and all-round witty dude. Though people have tried, it's hard to overstate the impact of his work on the tropes, attitutudes and even words that we possess today. Often, the number of words he invented is referred to as a barometer of his unparalleled originality and linguistic skill. All sorts of numbers are thrown around, and while we'll see such claims are very often exaggerated, he unequivocally contributed hundreds of terms to the vocabulary of English speakers.

In this notebook, we investigate the word usage data of Shakespeare's neologism's between the years 1800 and 2000. That is, the data on how frequently published works utilised the words that Shakespeare invented. This time period was chosen as a compromise between temporal (and thus cultural) breadth, and quality of data - 200 years is enough to find significant cultural changes, while not reaching so far into the past that the records on published works become sparse and unreliable.

The main sources used in this analysis are listed below. Their usage is signposted appropriately, throughout.

  1. https://www.shakespearescoinages.com/coinages (which itself uses the OED and https://proquest.libguides.com/eebopqp)
  2. https://www.shakespeareswords.com/
  3. https://www.oed.com
  4. https://www.etymonline.com/
  5. https://books.google.com/ngrams/ (via https://ngrams.dev/)

There are a number of difficulties, biases and observations that we must acknowledge before we proceed.

  1. Shakespeare's creativity, influence and linguistic legacy are so much more than just his neologisms. This project is not a (positive or negative) commentary on Shakespeare's work, and ultimately his influence is best seen in the stories we tell and the attitudes we hold, rather than his neologisms. Even his influence on the English language is better analysed when considering the terms he popularised rather than invented, as well as the grammatical and narrative structures that he pioneered.
  2. Determining which words Shakespeare actually invented is immensely difficult. There are a wealth of conflicting claims out there on the issue, and wading through it a thankless task. Mercifully, there is plenty of good data and literature on the issue, of which we make full use.
  3. The unreliability and biased nature of the Google Books Ngrams Dataset pervades our analysis. Our data on word usage over time is sourced from this dataset. As a tool for linguistic and social analysis, it is well-known. It's limitations, too, are well-known. For instance, scientific literature makes up a disproportionate amount of its database. Additionally, it is suspected that there are somewhat significant errors in the data due to the imperfect character recognition software with which it was built.
  4. Some groups of people were/are published far less than others. For instance, in 2020, Richard Jean So and Gus Wezerek collated a list of all 8004 books published between 1950 and 2018, that were held by at least 10 libraries and available digitally, and published by one of Simon & Schuster, Penguin Random House, Doubleday, HarperCollins and Macmillan. They found that a staggering 95% of the authors were white ('Just How White Is the Book Industry?', The New York Times). The data we are using ultimately only speaks to the vocabularies of a small section of broader society.
  5. Words are imprecise, unpredictable things. For instance, it could be that a word Shakespeare invented completely dies out from use, and then 200 years later someone else, entirely independent of Shakespeare, jolts up from bed in the middle of the night to exclaim "Besmirch!"... A sincere example is the word 'biddy', whose first recorded use is in Shakespeare's Twelfth Night. There, it is used to mean 'chicken'. Over 200 years later, the term was used to refer to young Irish women who immigrated to the US and worked as domestic servants (Wimmin, Wimps & Wallflowers: An Encyclopaedic Dictionary of Gender and Sexual Orientation Bias in the United States by Philip Herbst). In this case, it's believed there is a causal link and this this is another instance of a word referring to an animal mutating into dehumanising slang for women. Generally, however, it will be difficult to determine whether such a link exists. For an example where there is clearly no link, 'ear-piercing' first appears in written English in Shakespeare's Othello to describe the sound of a fife), clearly distinct to the modern meaning of the term. I will refer to such words as repurposed. When it comes to the data analysis, I will remove the repurposed words for which I can see no link to their modern meaning.
  6. I am a beginner. This work is a passion project, whose main purpose is to allow me to practice my research and data analytical skills. I am doing this unsupervised, and will no doubt make mistakes and miss biases.

As a result of the above discussions, the conclusions found here should not be taken particularly seriously, and at the very least should be viewed with an immensely critical eye and a strong awareness of its limitations.

Historically, Shakespeare has been awarded immense credit for his lexical originality. The contemporary academic view is that this has been massively overstated (see, for instance, the influential publication Shakespeare's Vocabulary: Myth and Reality by Hugh Craig, or the work of the Encyclopedia of Shakespeare's Language project at the University of Lancaster). In this analysis, we are not interested so much in the volume of words Shakespeare invented, so much as how usage of these words has varied over time. To that end, in choosing the collection of words whose use we analyse, our approach will be to omit any word for whom we find even the slightest doubt over its origin.

Identifying Shakespeare's Neologisms¶

We begin with this 582-term list, compiled in 2018 by H. M. Sénéchal, a doctoral researcher at the University of Birmingham's Shakespeare Institute. It was created "from cross-checking the 1475 words listed in the Oxford English Dictionary (OED) as first recorded in Shakespeare against the corpus of texts available in Early English Books Online". Immediately, there are two obvious ways in which 'imposter' neologisms could feature on this list. Firstly, words that the OED over-excitedly and erroneously credited to Shakespeare. Secondly, words that didn't feature in the Early English Books Online corpus due to the incompleteness of the corpus. There are also a multitude of entries in the list where Shakespeare was the first to use them as, say, a verb. One fun example is 'corslet', which since the 14th century had meant a piece of armour that wrapped around the torso. In 1634, in The Two Noble Kinsmen, Shakespeare (or possibly John Fletcher) had the streak of creativity to use 'corslet' to mean embracing somebody. To avoid the immense task of discriminating between uses of a word in its noun form and uses in its verb form, such cases are discounted. However, it is worth acknowledging that Shakespeare demonstrated great rhetorical imagination in many of these cases. This last example demonstrates another rule we'll be following: this 'new word' is appearing in a play that was co-written by Shakespeare. Surely, the ghost of John Fletcher has had more than enough of being overshadowed by Shakespeare, so we will be omitting any entries that originate in joint works by Shakespeare. Multiple-word phrases will also not be included, to keep things simple.

Thus, our precise method is as follows. Iterate through the 582 words, using the excellent Shakespearean corpus search engine Shakespeare's Words to identify the play/poem and year in which Shakespeare first used the word. Then, look for any suggestion whatsoever that the word may have been used earlier, using for example the Online Etymology Dictionary (note the goal here is not to confirm beyond all doubt that the word was used earlier, but instead to narrow down our list of Shakespeare's words to only those with minimal doubt over the origins).

The end result is below. As far as I can tell, there is no similar published list where all the words have been manually checked, which would make this the definitive list of words first used by Shakespeare for which there is seemingly no evidence of any prior use.

In [1]:
with open('shakespeare-words.txt') as file:
    words = file.read().splitlines()
words
Out[1]:
['abrook',
 'acture',
 'adoptedly',
 'adoptious',
 'agued',
 'airless',
 'allottery',
 'annexment',
 'anthropophaginian',
 'apathaton',
 'appertainment',
 'arm-gaunt',
 'aroint',
 'askance',
 'assubjugate',
 'attask',
 'attributive',
 'back-swordman',
 'barful',
 'baseless',
 'batler',
 'bawcock',
 'beached',
 'be-all',
 'becomed',
 'behowl',
 'bemock',
 'bescreen',
 'besmirch',
 'besort',
 'betrim',
 'betumbled',
 'biddy',
 'birthdom',
 'blastment',
 'boggler',
 'bold-beating',
 'bragless',
 'brisky',
 'bubukle',
 'buck-washing',
 'budger',
 'cannibally',
 'carlot',
 'casted',
 'caudie',
 'cerement',
 'chaffless',
 'chaliced',
 'characterless',
 'childness',
 'chirurgeonly',
 'cital',
 'cloistress',
 'cloud-capped',
 'cloyless',
 'cloyment',
 'cockled',
 'codding',
 'combless',
 'compromised',
 'compunctious',
 'concernancy',
 'concupy',
 'confineless',
 'congree',
 'congreet',
 'conspectuity',
 'copataine',
 'correctioner',
 'corresponsive',
 'crack-hemp',
 'crestless',
 'cuckoo-bud',
 'cyme',
 'definement',
 'defunctive',
 'demi-puppet',
 'denotement',
 'deracinate',
 'derogately',
 'disbench',
 'discandy',
 'discase',
 'disedge',
 'disliken',
 'dislimn',
 'disorb',
 'disproperty',
 'disvouch',
 'dobbin',
 'dotant',
 'down-gyved',
 'droplet',
 'ear-piercing',
 'earthbound',
 'emballing',
 'embrasure',
 'empiricutic',
 'enacture',
 'encompassment',
 'end-all',
 'enmesh',
 'enridged',
 'enschedule',
 'enswathe',
 'entame',
 'eventful',
 'exposture',
 'exsufflicate',
 'extincture',
 'fangless',
 'fat-witted',
 'fedarie',
 'festinately',
 'fishify',
 'fitful',
 'fixture',
 'fleckled',
 'fleshment',
 'footfall',
 'forgetive',
 'full-hearted',
 'fustilarian',
 'gnarled',
 'gratulate',
 'gravel-blind',
 'half-cheek',
 'headshake',
 'hedge-pig',
 'herblet',
 'hewgh',
 'hillo',
 'hodge-pudding',
 'immediacy',
 'immoment',
 'impaint',
 'imperceiverant',
 'impleach',
 'importless',
 'inclip',
 'incorpsed',
 'infamonize',
 'inhoop',
 'inscroll',
 'inshell',
 'insisture',
 'insultment',
 'insuppressive',
 'intertissued',
 'intrenchant',
 'intrince',
 'inventorially',
 'invised',
 'iterance',
 'jaunce',
 'jointress',
 'kecksy',
 'keech',
 'kickie-wickie',
 'lacklustre',
 'land-damn',
 'languageless',
 'laughable',
 'lewdster',
 'lifelings',
 'malicho',
 'mansionry',
 'mappery',
 'marcantant',
 'mid-season',
 'militarist',
 'milk-livered',
 'millioned',
 'minimus',
 'misadventured',
 'misgraffed',
 'mistership',
 'mobled',
 'mockable',
 'moorship',
 'moraller',
 'moulten',
 'nayward',
 'near-legged',
 'noncome',
 'non-regardance',
 'nook-shotten',
 'old-faced',
 'omittance',
 'oneyer',
 'opposeless',
 'ouphe',
 'out-crafty',
 'outdwell',
 'out-Herod',
 'outlustre',
 'outsweeten',
 'out-villain',
 'overglance',
 'overgreen',
 'overperch',
 'overpost',
 'over-red',
 'overripened',
 'overscutched',
 'oversnow',
 'overstink',
 'overteeming',
 'overweathered',
 'pajock',
 'pebbled',
 'pensived',
 'persistency',
 'persistive',
 'phantasim',
 'philippan',
 'phraseless',
 'pilcher',
 'pioned',
 'plighter',
 'plumpy',
 'pole-clipped',
 'posied',
 'preceptial',
 'precurrer',
 'preformed',
 'preyful',
 'primogenitive',
 'primy',
 'pugging',
 'pupil-like',
 'quatch',
 'questant',
 'questrist',
 'razorable',
 'reclusive',
 'recountment',
 'rejoindure',
 'relier',
 'relume',
 'remediate',
 'reprobance',
 'reputeless',
 'restem',
 'resurvey',
 'reverb',
 'reword',
 'ribaudred',
 'rondure',
 'rooky',
 'rootedly',
 'rose-lipped',
 'routed',
 'rubious',
 'rug-headed',
 'rumourer',
 'runnion',
 'scaffoldage',
 'scamel',
 'scrimer',
 'scrippage',
 'sea-wing',
 'sedged',
 'self-abuse',
 'self-assumption',
 'self-harming',
 'self-reproving',
 'semblative',
 'sessa',
 'shard-borne',
 'sheeted',
 'sistering',
 'skyish',
 'sledded',
 'sleided',
 'slickly',
 'slish',
 'small-knowing',
 'snail-slow',
 'sneaping',
 'so-forth',
 'sortance',
 'spectatorship',
 'sprag',
 'sternage',
 'still-stand',
 'stitchery',
 'strewment',
 'successantly',
 'sumless',
 'superscript',
 'superserviceable',
 'suraddition',
 'tanling',
 'thumb-ring',
 'tirrits',
 'toged',
 'torcher',
 'tranect',
 'triumviry',
 'unaching',
 'unaneled',
 'unauspicious',
 'unbonneted',
 'unbookish',
 'unbreeched',
 'unchary',
 'unclaimed',
 'uncolted',
 'uncomprehensive',
 'unconfinable',
 'uncuckolded',
 'uncurbable',
 'undercrest',
 'under-hangman',
 'under-honest',
 'underpeep',
 'under-skinker',
 'uneducated',
 'unexperient',
 'unfilial',
 'unforfeited',
 'ungenitured',
 'ungravely',
 'unimproved',
 'unmeritable',
 'unmitigated',
 'unowed',
 'unpinked',
 'unplausive',
 'unpolicied',
 'unpregnant',
 'unprofited',
 'unqualitied',
 'unrecuring',
 'unseminared',
 'unsex',
 'unshout',
 'unshrubbed',
 'unsisting',
 'unslipping',
 'unsmirched',
 'unstooping',
 'unswayed',
 'untented',
 'untimbered',
 'untreasured',
 'unwedgeable',
 'unweighing',
 'unwrung',
 'up-pricked',
 'uproused',
 'upswarmed',
 'versal',
 'villagery',
 'vizament',
 'wafture',
 'war-proof',
 'war-worn',
 'watch-case',
 'water-rug',
 'wealsman',
 'weather-fend',
 'well-sailing',
 'wenchless',
 'whereuntil',
 'white-handed',
 'widowmaker',
 'windring',
 'winter-ground',
 'yellowing',
 'young-eyed',
 'yravish']

By my count, Shakespeare invented (approximately, and probably at most) 374 words. Sorry, Shakespeare's Birthplace Trust...

In [2]:
len(words)
Out[2]:
374

Now, we remove the words that were repurposed (see Words are imprecise, unpredictable things in the introduction) where there is no reasonable link between Shakespeare's usage and contemporary usage.

In [3]:
for word in ['embrasure', 'ear-piercing', 'footfall', 'keech', 'militarist', 'reverb', 'thumb-ring', 'unrecuring']:
    words.remove(word)
In [4]:
len(words)
Out[4]:
366

We now have our finalized list of 366 words that we will take as having been invented by Shakespeare, among them classics such as "reclusive", creative insults such as "milk-livered", and absurdities like "fishify" (yes, that's to turn into a fish).

Obtaining Data¶

Our first source of word usage data is the Google Books Ngram Dataset, which counts the number of appearances of every word across the entirety of the enormous Google Books database. While it's certainly possible to scrape the Google Ngrams website directly, this approach runs into problems when issuing multiple 'get' requests. Instead, we turn to the fantastic Google Ngrams search engine NGRAMS, and utilise its API.

The function below accesses this API, and returns the relative usage (i.e. the proportional usage as a total of the Google Books Ngram Dataset - for example, 'the' made up ~4.45% of all written words published in 2000, so the proportional usage would be 0.0445) of the input word for each of the years 1800 - 2000. The function takes into account case-sensitivity issues by totalling the usage counts for the relevant cases - for example, ngrams_data('apple')[1800] would give the combined relative usage in the year 1800 of the 1-grams 'apple', 'Apple', and 'APPLE'.

In [5]:
import requests
years = [year for year in range(1800, 2001)]

def ngrams_data(word): 
    url = 'https://api.ngrams.dev/eng/search?query=' + word + '&limit=3'
    cases_json = requests.get(url).json()
    word_usage = dict()
    for year in years:
        word_usage[year] = 0
    for case in cases_json['ngrams']:
        id = case['id']
        url = 'https://api.ngrams.dev/eng/' + id
        usage_json = requests.get(url).json()
        usage_stats = usage_json['stats']
        for entry in usage_stats:
            year = entry['year']
            if year in years:
                word_usage[year] += entry['relMatchCount']
    return word_usage

Now to use our function to gather the data for our list of Shakespearean words from the API, with respectful 1 second gaps between calls to ngrams_data to spread out our https requests.

In [7]:
import time

all_words_usage = dict()

for word in words:
    all_words_usage[word] = ngrams_data(word)
    time.sleep(1)

The below code saves the data we’ve scraped as a json file, so that we can access the data locally.

In [8]:
import json

with open('shakespeare-usage-data.json', 'w') as file:
    json_string = json.dumps(all_words_usage, indent=4)
    file.write(json_string)

The reader may access the json file here. To load it into Python, save the file into your working directory and run the code in the cell below.

In [9]:
import json

with open('shakespeare-usage-data.json', 'r') as file:
    all_words_usage = json.load(file)

We want to work with our data in pandas. First, we convert dashes to underscores to better work with pandas, and then we compute the biggest and smallest values in the DataFrame so that we can decide whether to scale the data by some factor.

In [10]:
import pandas as pd
import numpy as np

all_words_usage2 = { word.replace('-', '_'): data for word, data in all_words_usage.items() }

df = pd.DataFrame(all_words_usage2)

minimum_val = df[df > 0].min(numeric_only=True).min()
maximum_val = df.max().max()
'Minimum value is ' + str(minimum_val) + ', maximum value is ' + str(maximum_val)
Out[10]:
'Minimum value is 4.026767325012956e-11, maximum value is 6.7087908087797715e-06'

Based on these values, scaling by $10^6$ seems logical, especially since we can then label our axes with 'usage per million words'.

In [11]:
df = df.multiply(10**6)
df
Out[11]:
abrook acture adoptedly adoptious agued airless allottery annexment anthropophaginian apathaton ... well_sailing wenchless whereuntil white_handed widowmaker windring winter_ground yellowing young_eyed yravish
1800 0.011905 0.023811 0.003968 0.003968 0.007937 0.071432 0.003968 0.000000 0.007937 0.000000 ... 0.003968 0.003968 0.007937 0.031748 0.000000 0.003968 0.000000 0.055559 0.019842 0
1801 0.000000 0.008101 0.000000 0.000000 0.008101 0.016202 0.004051 0.004051 0.000000 0.000000 ... 0.000000 0.000000 0.008101 0.032405 0.000000 0.000000 0.008101 0.000000 0.028354 0
1802 0.004205 0.029438 0.004205 0.000000 0.008411 0.042054 0.004205 0.012616 0.012616 0.000000 ... 0.000000 0.000000 0.008411 0.004205 0.000000 0.000000 0.004205 0.042054 0.021027 0
1803 0.017290 0.040343 0.011527 0.023053 0.008645 0.031698 0.020172 0.020172 0.005763 0.000000 ... 0.005763 0.002882 0.031698 0.083568 0.000000 0.005763 0.034580 0.020172 0.028816 0
1804 0.026003 0.019502 0.000000 0.000000 0.022753 0.039005 0.000000 0.009751 0.003250 0.000000 ... 0.009751 0.009751 0.003250 0.019502 0.000000 0.000000 0.013002 0.019502 0.022753 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1996 0.001804 0.011498 0.000271 0.000361 0.001308 0.220984 0.000361 0.001308 0.000271 0.000000 ... 0.000180 0.000271 0.000225 0.008657 0.006854 0.001082 0.000316 0.506355 0.001353 0
1997 0.001457 0.011289 0.000182 0.000319 0.001502 0.231798 0.000273 0.000956 0.000228 0.000000 ... 0.000091 0.000319 0.000228 0.007010 0.011563 0.000091 0.000091 0.509663 0.001730 0
1998 0.001610 0.009166 0.000402 0.000447 0.001654 0.233833 0.000671 0.001073 0.000268 0.000089 ... 0.000268 0.000447 0.000626 0.007377 0.027273 0.000760 0.000849 0.511303 0.001252 0
1999 0.001853 0.009396 0.000221 0.000221 0.002867 0.222601 0.000353 0.001191 0.000221 0.000000 ... 0.000132 0.000265 0.000485 0.011205 0.008823 0.000397 0.000221 0.485434 0.001720 0
2000 0.001450 0.007007 0.000725 0.000483 0.002255 0.240398 0.000483 0.001289 0.000765 0.000040 ... 0.000161 0.000322 0.000483 0.010913 0.009664 0.000725 0.000322 0.545708 0.002013 0

201 rows × 366 columns

As a fun first glimpse, as well as a test of the accuracy of our data, here is a graph of the relative usage of 'be-all' vs the relative usage of 'end-all'.

In [12]:
import matplotlib.pyplot as plt

x = years
y1 = df.be_all
y2 = df.end_all

plt.figure(figsize=(12, 5))
plt.plot(x, y1, label=f"{'be-all'}")
plt.plot(x, y2, label=f"{'end-all'}")
plt.xlabel('Year')
plt.ylabel('Usage per million words')
plt.legend()
plt.title('Fig. 1')
plt.show()
No description has been provided for this image

The data is probably worth smoothing, for graph readability. For this purpose, we utilise scipy's uniform_filter1d.

In [13]:
from scipy.ndimage import uniform_filter1d

x = years
y1 = uniform_filter1d(df.be_all, size=10)
y2 = uniform_filter1d(df.end_all, size=10)

plt.figure(figsize=(12, 5))
plt.plot(x, y1, label=f"{'be-all (smoothed)'}")
plt.plot(x, y2, label=f"{'end-all (smoothed)'}")
plt.xlabel('Year')
plt.ylabel('Usage per million words')
plt.legend()
plt.title('Fig. 2')
plt.show()
No description has been provided for this image

As expected, the usage of the two words are well-matched! Curiously enough, be-all edges it at the beginning of our time period. By the end, though, end-all wins out...

Analysis¶

With everything set up, we can look at the dataset as a whole. Let's look at the total usage of all the words combined, across the entire timespan.

In [14]:
x = years
y1 = df.sum(axis=1)
y2 = uniform_filter1d(y1, size=10)

plt.figure(figsize=(12, 5))
plt.plot(x, y1, label=f"{'Total usage'}")
plt.plot(x, y2, label=f"{'Total usage (smoothed)'}")
plt.xlabel('Year')
plt.ylabel('Usage per million words')
plt.legend()
plt.title('Fig. 3')
plt.show()
No description has been provided for this image

Note that my preference is to plot both the raw data as well as some smoothed version, so that trends can be identified outside of anomolies, while at the same time the oft-crucial information within anomolies is still visible.

In Fig. 3, we see usage sharply rise between 1815 and 1850, where it temporarily plateaus. From 1875 to 1975, it gradually falls, though never quite to 1800 levels. Finally, usage sharply rises between 1975 and 2000. Shakespeare's neologisms appear to have serious lasting power, having not just hung around in our vocabulary, but actually having seriously increased in popularily over the last two decades.

There is a noticeable peak around 1850, which deserves further investigation.

In [15]:
'The year with the highest relative usage is ' + str(df.sum(axis=1).idxmax()) + ', when the usage was ' + str(round(df.sum(axis=1).max(), 2)) + ' per million words.'
Out[15]:
'The year with the highest relative usage is 1848, when the usage was 31.36 per million words.'

Who are the biggest contributors to this anomolously enormous year? A sensible course of action would be to plot a bar graph of the usage of each word in 1848, except that it would have 367 bars. To narrow this down, we identify the word with the largest usage, and then filter out all the words with usage 10% or less than this word in 1848.

In [16]:
x = words
y = df.loc['1848']

df.loc['1848'].max()
highest_word = df.loc['1848'].idxmax()
highest_word_usage = df.loc['1848'].max()

"The word with the highest relative usage in 1848 is '" + df.loc['1848'].idxmax() + "', whose usage was " + str(round(highest_word_usage, 2)) + ' per million words.'
Out[16]:
"The word with the highest relative usage in 1848 is 'dobbin', whose usage was 4.62 per million words."
In [17]:
df.loc['1848'][df.loc['1848'] >= 0.462]
Out[17]:
baseless       0.825959
biddy          0.877815
compromised    2.478811
dobbin         4.618923
eventful       4.218090
fitful         1.735074
fixture        0.574154
gnarled        0.600315
laughable      0.984797
routed         4.045704
unclaimed      1.123547
uneducated     2.059291
unimproved     1.181476
unmitigated    1.359469
Name: 1848, dtype: float64

An important fact jumps out, at once. In 1948, it is only really the usage of 14 words that are contributing to the total. In general, when looking at the total usage of all Shakespeare's neologisms, those few exceptional words that are most used will dominate the data, and the more obscure words will be entirely obfuscated. Is this what we want? On the one hand, the most-used words are certainly in some sense the most important or relevant. On the other hand, the variation in usage of the more obscure words will tell its own story, which I for one would like to hear.

Currently, this story is being drowned out by the roar of dobbin and eventful, which we circumvent by normalising the data. By this, I mean scaling the data for each individual word so that they are all on the same scale, and thus each have an equal impact on the summed data. For instance, suppose we have a word that is used just once in year 1, and another word that is used 1000 times. The following year, the first word is used twice and the second word is used 500 times. In the normalised data, the effects of the two words will cancel out, since one has doubled and one has halved. Instead of looking at the absolute change of usage for each word, we are considering the proportional change of usage. Thinking geometrically, we are essentially combining the shapes of all the individual word usage graphs, and ignoring their individual size. We do this by dividing the data for each word by (an estimation of) the integral of the word's graph (which is essentially that word's total usage over all years).

In [18]:
from scipy import integrate

norm_words_usage = dict()

for word in all_words_usage:
    word_usage = dict()
    integral = integrate.trapezoid(list(all_words_usage[word].values()))
    if integral == 0:
        continue
    for year in years:
        word_usage[year] = all_words_usage[word][str(year)] / integral
    norm_words_usage[word] = word_usage
    
norm_words_usage2 = { word.replace('-', '_'): data for word, data in norm_words_usage.items() }

ndf = pd.DataFrame(norm_words_usage2)
ndf
Out[18]:
abrook acture adoptedly adoptious agued airless allottery annexment anthropophaginian apathaton ... weather_fend well_sailing wenchless whereuntil white_handed widowmaker windring winter_ground yellowing young_eyed
1800 0.013545 0.003649 0.008795 0.007124 0.007094 0.002377 0.006344 0.000000 0.015998 0.000000 ... 0.000000 0.013022 0.014132 0.010146 0.008428 0.000000 0.016377 0.000000 0.000973 0.006026
1801 0.000000 0.001241 0.000000 0.000000 0.007241 0.000539 0.006475 0.005375 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.010356 0.008603 0.000000 0.000000 0.009110 0.000000 0.008611
1802 0.004785 0.004511 0.009320 0.000000 0.007518 0.001399 0.006723 0.016741 0.025429 0.000000 ... 0.000000 0.000000 0.000000 0.010751 0.001116 0.000000 0.000000 0.004729 0.000737 0.006386
1803 0.019671 0.006182 0.025545 0.041384 0.007727 0.001055 0.032247 0.026766 0.011616 0.000000 ... 0.000000 0.018912 0.010262 0.040520 0.022185 0.000000 0.023784 0.038887 0.000353 0.008751
1804 0.029584 0.002988 0.000000 0.000000 0.020337 0.001298 0.000000 0.012939 0.006551 0.000000 ... 0.000000 0.031997 0.034724 0.004155 0.005177 0.000000 0.000000 0.014621 0.000342 0.006910
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1996 0.002052 0.001762 0.000600 0.000648 0.001169 0.007354 0.000577 0.001735 0.000545 0.000000 ... 0.000000 0.000592 0.000963 0.000288 0.002298 0.030538 0.004466 0.000355 0.008872 0.000411
1997 0.001657 0.001730 0.000404 0.000572 0.001343 0.007714 0.000437 0.001268 0.000459 0.000000 ... 0.000228 0.000299 0.001135 0.000291 0.001861 0.051520 0.000376 0.000102 0.008930 0.000525
1998 0.001831 0.001404 0.000892 0.000803 0.001479 0.007781 0.001072 0.001424 0.000541 0.009079 ... 0.000449 0.000880 0.001592 0.000800 0.001958 0.121523 0.003137 0.000955 0.008959 0.000380
1999 0.002108 0.001440 0.000489 0.000396 0.002563 0.007408 0.000564 0.001580 0.000445 0.000000 ... 0.000000 0.000434 0.000943 0.000620 0.002975 0.039313 0.001638 0.000248 0.008506 0.000522
2000 0.001649 0.001074 0.001606 0.000867 0.002016 0.008000 0.000772 0.001710 0.001542 0.004088 ... 0.000404 0.000529 0.001147 0.000618 0.002897 0.043062 0.002991 0.000362 0.009562 0.000611

201 rows × 360 columns

In [19]:
x = years
y1 = ndf.sum(axis=1)
y2 = uniform_filter1d(y1, size=10)

plt.figure(figsize=(12, 5))
plt.plot(x, y1, label=f"{'Total normalised usage'}")
plt.plot(x, y2, label=f"{'Total normalised usage (smoothed)'}")
plt.xlabel('')
plt.legend()
plt.title('Fig. 4')
plt.show()
No description has been provided for this image

Clear as day, we see that from 1925 onwards, this normalised usage is consistently miniscule. This suggests that most of Shakespeare's neologisms have all but entirely fallen out of usage. The presence of relatively high post-1925 total (non-normalised) usage (Fig. 3) tells us those words which have managed to persevere have become popular enough to compensate for the drop-off from the other words.

We now verify these observations by referring back to the original data, counting how many of the neologisms are contributing at least 0.01% to the total usage of all the words in each year.

In [21]:
contribution_count = dict()
for year in years:
    contribution_count[year] = 0

for year in years:
    data = df.loc[str(year)]
    total = sum(data)
    for word, series in df.items():
        contribution = data[word] / total
        if contribution >= 0.0001:
            contribution_count[year] += 1

x = years
y1 = [contribution_count[year] for year in contribution_count]
y2 = uniform_filter1d(y1, size=10)

plt.figure(figsize=(12, 5))
plt.plot(x, y1, label=f"{'Contributions over 0.01%'}")
plt.plot(x, y2, label=f"{'Contributions over 0.01% (smoothed)'}")
plt.xlabel('Year')
plt.ylabel('Number of words')
plt.legend()
plt.title('Fig. 5')
plt.show()
No description has been provided for this image

Indeed, we see that by 1990, it seems that less than 1 in 3 of Shakespeare's neologisms remain even vaguely in use. We can't claim this definitively yet, as we have to accept the possibility, that the collection of individual words that are contributing over 0.01% is changing violently year on year. This is something we will rule out later on (see Fig. 6).

We now turn our attention to the most popular words. Which of the word's was used most over the entire period? Which words rose to the fore of people's mouths in each decade, and how often have the most popular words changed?

We begin the only place we can - with the top 10 of all time!

In [22]:
total_df = df.groupby(lambda year : 'Total').aggregate('sum')
most_popular_words = list(total_df.sort_values(by='Total', ascending=False, axis=1).iloc[0][:10].keys())
most_popular = {n+1: most_popular_words[n] for n in range(10)}
most_popular
Out[22]:
{1: 'routed',
 2: 'fixture',
 3: 'eventful',
 4: 'compromised',
 5: 'unimproved',
 6: 'uneducated',
 7: 'unclaimed',
 8: 'fitful',
 9: 'laughable',
 10: 'unmitigated'}

Now we look at the top 10 for each decade, how regularly it changes, and which words have been the most ubiquitous. (For instance, has 'fixture' been a fixture, and has 'routed' remained rooted to the top?)

To start this analysis, we group our dataframe together, by decade.

In [23]:
decades_df = df.groupby(lambda year : int(year) // 10 * 10).aggregate('sum')
decades_df = decades_df[:-1]
decades_df
Out[23]:
abrook acture adoptedly adoptious agued airless allottery annexment anthropophaginian apathaton ... well_sailing wenchless whereuntil white_handed widowmaker windring winter_ground yellowing young_eyed yravish
1800 0.125836 0.229156 0.117106 0.099002 0.135512 0.447831 0.133111 0.136349 0.101356 0.000000 ... 0.066944 0.062834 0.189186 0.567419 0.002457 0.021887 0.277407 0.186077 0.201837 0
1810 0.069620 0.182915 0.038053 0.043639 0.102913 0.257031 0.059557 0.040443 0.047675 0.000000 ... 0.030753 0.023845 0.069874 0.318453 0.001352 0.008041 0.071665 0.123535 0.208261 0
1820 0.077230 0.265321 0.054080 0.104531 0.102090 0.335159 0.096987 0.083020 0.090400 0.000000 ... 0.054187 0.041252 0.120274 0.219643 0.003537 0.014375 0.119239 0.176530 0.316322 0
1830 0.056054 0.211788 0.022257 0.019566 0.066333 0.497672 0.026099 0.053116 0.024556 0.000000 ... 0.017332 0.009198 0.053611 0.182956 0.000586 0.000525 0.031962 0.201210 0.290415 0
1840 0.046676 0.255265 0.038959 0.037915 0.082556 0.402954 0.036315 0.034955 0.024947 0.000000 ... 0.021164 0.019044 0.046106 0.244592 0.000000 0.007480 0.067720 0.323006 0.320240 0
1850 0.054264 0.300287 0.039217 0.048202 0.096901 0.572581 0.043565 0.056476 0.050578 0.000000 ... 0.025577 0.026732 0.077952 0.298594 0.001531 0.023468 0.071834 0.469119 0.339685 0
1860 0.063534 0.314100 0.031506 0.044709 0.077260 0.847712 0.041609 0.062253 0.042818 0.000334 ... 0.021691 0.022936 0.056685 0.253320 0.000000 0.017072 0.044458 0.842497 0.296733 0
1870 0.047888 0.347075 0.019248 0.024753 0.069474 0.841320 0.023288 0.055481 0.018779 0.000000 ... 0.010149 0.012937 0.034044 0.269365 0.000000 0.024374 0.037698 1.080145 0.234111 0
1880 0.050335 0.388131 0.022988 0.032714 0.074478 1.083990 0.042682 0.048881 0.029995 0.001848 ... 0.016356 0.016743 0.037137 0.277957 0.000000 0.015142 0.053703 1.260023 0.251133 0
1890 0.037446 0.399386 0.016582 0.025774 0.045240 0.990050 0.033199 0.031005 0.013320 0.002326 ... 0.010276 0.010533 0.021907 0.238308 0.001607 0.023586 0.027304 1.951139 0.197259 0
1900 0.053606 0.440593 0.018555 0.032191 0.048018 1.211849 0.030625 0.029362 0.022641 0.002871 ... 0.011921 0.011574 0.027948 0.172610 0.000484 0.022224 0.028665 2.236586 0.171059 0
1910 0.062215 0.592588 0.004141 0.004439 0.020159 1.516580 0.010929 0.013731 0.004391 0.000609 ... 0.002945 0.002862 0.007588 0.116337 0.000762 0.008264 0.009672 3.113960 0.098865 0
1920 0.027583 0.461001 0.004921 0.009146 0.034772 2.245588 0.009173 0.013599 0.004220 0.000297 ... 0.003298 0.004097 0.007340 0.116549 0.002413 0.008051 0.008047 4.520962 0.104841 0
1930 0.024788 0.284238 0.004743 0.004155 0.027449 3.299806 0.009331 0.016381 0.003283 0.000177 ... 0.002116 0.003105 0.004758 0.078832 0.004555 0.008462 0.007334 6.224235 0.078357 0
1940 0.010760 0.367722 0.004611 0.004684 0.025031 2.134198 0.006840 0.013079 0.003841 0.000609 ... 0.002700 0.002877 0.007813 0.058318 0.014969 0.007819 0.006484 6.316262 0.049197 0
1950 0.017220 0.291674 0.004765 0.009273 0.024348 2.689593 0.008778 0.016524 0.003988 0.000118 ... 0.004270 0.004430 0.010587 0.059736 0.006977 0.014245 0.007110 6.534719 0.040380 0
1960 0.015439 0.313755 0.003928 0.006070 0.026537 3.575233 0.005707 0.018768 0.003499 0.000456 ... 0.001520 0.001875 0.005219 0.065084 0.015007 0.007511 0.007660 5.830249 0.045509 0
1970 0.015118 0.379616 0.002484 0.002717 0.024720 2.270413 0.003438 0.009954 0.003521 0.000000 ... 0.001073 0.001619 0.002461 0.097191 0.025974 0.003911 0.004394 5.037775 0.025779 0
1980 0.012333 0.353437 0.002035 0.002105 0.019832 2.441097 0.002673 0.009803 0.002687 0.000000 ... 0.001055 0.001545 0.002108 0.063114 0.041790 0.002526 0.003085 5.281333 0.015697 0
1990 0.016236 0.156210 0.002662 0.003205 0.017977 2.304817 0.003374 0.009802 0.003224 0.000183 ... 0.001325 0.002604 0.003417 0.078890 0.095592 0.004983 0.003633 5.118048 0.016051 0

20 rows × 366 columns

From this, we may extract the top 10 for each decade.

In [24]:
most_popular_by_decade = dict()

for i in range(20):
    decade = 1800 + i*10
    popular_words = list(decades_df.sort_values(by=decade, ascending=False, axis=1).iloc[i][:10].keys())
    most_popular_by_decade[str(decade) + 's'] = {n+1: popular_words[n] for n in range(10)}

pd.DataFrame(most_popular_by_decade)
Out[24]:
1800s 1810s 1820s 1830s 1840s 1850s 1860s 1870s 1880s 1890s 1900s 1910s 1920s 1930s 1940s 1950s 1960s 1970s 1980s 1990s
1 routed routed routed eventful eventful eventful eventful eventful eventful eventful fixture fixture fixture fixture fixture fixture fixture fixture fixture compromised
2 eventful eventful eventful routed routed routed routed routed routed routed unimproved unimproved unimproved unclaimed routed routed routed compromised compromised fixture
3 laughable compromised compromised compromised compromised compromised compromised uneducated compromised compromised compromised routed routed routed unclaimed unclaimed compromised droplet droplet droplet
4 unimproved unclaimed uneducated uneducated uneducated uneducated uneducated compromised unimproved unimproved routed compromised unclaimed unimproved unimproved compromised droplet routed routed routed
5 compromised unimproved laughable fitful fitful fitful fitful fitful uneducated fixture eventful unclaimed compromised compromised compromised unimproved uneducated immediacy immediacy immediacy
6 unclaimed laughable unclaimed laughable unmitigated unmitigated unmitigated unclaimed fitful uneducated unclaimed eventful uneducated eventful preformed droplet unclaimed unclaimed unclaimed unclaimed
7 biddy uneducated fitful unimproved unclaimed unimproved unimproved unimproved unclaimed unclaimed uneducated uneducated eventful uneducated eventful preformed unimproved uneducated superscript uneducated
8 uneducated biddy unimproved unmitigated unimproved unclaimed unclaimed unmitigated fixture fitful persistency persistency persistency carlot uneducated uneducated immediacy unimproved preformed superscript
9 baseless baseless baseless unclaimed laughable laughable baseless fixture persistency persistency fitful askance carlot yellowing carlot immediacy preformed preformed uneducated preformed
10 versal fitful unmitigated baseless dobbin baseless laughable persistency baseless baseless askance fitful askance persistency yellowing eventful eventful superscript unimproved unimproved

'Routed' does indeed do consistently well, if not exclusively top-of-the-pile. In contrast, 'fixture' fails to make an appearance until the 1870s!

As fun as such a tongue-in-cheek qualitative analysis can be, we of course wish to quantify our findings. We proceed by counting the number of times each word makes an appearance in a decade's top 10.

In [25]:
decades_pop_words = set()
for decade in most_popular_by_decade:
    for word in list(most_popular_by_decade[decade].values()):
        decades_pop_words.add(word)

pop_count = dict()
for word in decades_pop_words:
    pop_count[word] = 0
    for decade in most_popular_by_decade:
        for appearance in list(most_popular_by_decade[decade].values()):
            if word == appearance:
                pop_count[word] += 1

dict(sorted(pop_count.items(), key = lambda item: item[1], reverse = True))
Out[25]:
{'compromised': 20,
 'routed': 20,
 'unclaimed': 20,
 'unimproved': 20,
 'uneducated': 20,
 'eventful': 17,
 'fixture': 13,
 'fitful': 11,
 'baseless': 8,
 'laughable': 7,
 'persistency': 7,
 'unmitigated': 6,
 'preformed': 6,
 'immediacy': 5,
 'droplet': 5,
 'superscript': 3,
 'carlot': 3,
 'askance': 3,
 'biddy': 2,
 'yellowing': 2,
 'dobbin': 1,
 'versal': 1}

There are 5 words - compromised, routed, unclaimed, unimproved and uneducated - that have consistently been the most popular of Shakespeare's inventions. Personally, I'm glad that 'askance' and 'dobbin' had only a brief day in the sun.

This got me thinking: how often does the top 10 change? It's simple enough to look at the number of changes, from one decade to the next:

In [26]:
def how_many_changes(list1, list2):
    count = 0
    for word in list1:
        if word not in list2:
            count += 1
    return count

lists = []
for decade in most_popular_by_decade:
    lists.append(list(most_popular_by_decade[decade].values()))

changes = []
for i in range(19):
    changes.append(how_many_changes(lists[i], lists[i+1]))

import statistics
'The mean number of changes is ' + str(round(statistics.mean(changes), 2)) + '. The number of changes each year are ' + str(changes) + '.'
Out[26]:
'The mean number of changes is 0.74. The number of changes each year are [1, 1, 0, 1, 1, 0, 2, 1, 0, 1, 0, 1, 1, 1, 2, 0, 1, 0, 0].'

The top 10 is actually incredibly stable, reflecting the fact that our mainstream vocabulary changes very slowly (or, at least, it did prior to the information age).

This seems an interesting measure for how dynamic our data is, i.e. how much the dominant words change. In this pursuit, we scale up our method to look at the number of changes per year to the top 100 of Shakespeare's neologisms (our choice to consider the top 100 being motivated by Fig. 5).

In [27]:
most_popular_by_year = dict()

for i in range(201):
    year = 1800 + i
    popular_words = list(df.sort_values(by=str(year), ascending=False, axis=1).iloc[i][:100].keys())
    most_popular_by_year[year] = {n+1: popular_words[n] for n in range(100)}

lists = []
for year in most_popular_by_year:
    lists.append(list(most_popular_by_year[year].values()))

changes = []
for i in range(200):
    changes.append(how_many_changes(lists[i], lists[i+1]))

import statistics
mean_changes = round(statistics.mean(changes), 2)

x = years[1:]
y1 = changes
y2 = uniform_filter1d(y1, size=10)

plt.figure(figsize=(12, 5))
plt.plot(x, y1, label=f"{'Changes to top 100'}")
plt.plot(x, y2, label=f"{'Changes to top 100 (smoothed)'}")
plt.xlabel('Year')
plt.ylabel('Number of words')
plt.legend()
plt.title('Fig. 6')
plt.show()
No description has been provided for this image

In the early parts of the 19th century, the usage of Shakespeare's neologisms was highly dynamic, with over a third of the top 100 changing in a given year. As time goes on, things get more and more stable, and by the end, the words that are most popular are pretty firmly settled.

Note that this confirms our speculations after Fig. 5; by the latter stages of the 20th century, all but the top 100 or so words had pretty much died out of use.

Conclusions¶

  1. Shakespeare invented nothing like as many words as is so often claimed.
  2. The words he created have, as a whole, enjoyed consistent usage across the 19th and 20th centuries, and their overall popularity has in fact increased of late (i.e. in the latter parts of the 20th century).
  3. Even in the early parts of the 19th century, 200 years on from their invention, Shakespeare's neologisms were almost all still being featured in published works. At this point, there was a relatively high amount of change in which words were favoured. This decreased over time, and from 1925 onwards, only 1 in 3 of Shakespeare's words remained alive in the English language.
  4. A handful of the neologisms (compromised, routed, unclaimed, unimproved, uneducated, uneventful, and fixture) have had their usage dominate over the others, across the entire timespan.

Thanks for reading! 🧡