Counting the vocabulary in French novels

Counting the vocabulary in French novels

How much vocabulary do you need to read French literature novels? Do some novels contain a lot more vocabulary than others, thus making them harder to read? Let’s look into this by counting the different vocabulary terms in 3 French novels. We will consider 3 well known French novels, ‘La Bête humaine’ by Zola, ‘Madame Bovary’ by Flaubert, and ‘Les liaisons dangereuses’ by de Laclos.

How to count the vocabulary in a novel

Counting vocabulary sounds simple: you just write a computer program to count all the different words in a novel, right? Actually, it is a bit more complicated than that. Often, we will encounter the same vocabulary word but under different forms. For example the singular and plural form of a noun , (such as baguette and baguettes) should really be counted as a single vocabulary word in the case where they both appear in a novel. Also the different conjugated forms of a same verb should count as a single vocabulary word. Finally in French adjectives take on different forms depending on whether the noun they describe is singular, plural, masculine or feminine. If several forms of a same adjective appear in a novel (such as joli, jolie, jolis, jolies) we want to count them as a single vocabulary word.

Standardizing words:

  • Nouns converted to singular form: libertésliberté
  • Adjectives converted to the shortest form: joliesjoli
  • Verbs converted to base form: écoutaientécouter


Novel 1, La Bête humaine:
  • Novel length: 117491 words
  • Unique words (before standardizing): 8069
  • Unique words (after standardizing): 4686
Novel 2, Les liaisons dangereuses:
  • Novel length: 127232 words
  • Unique words (before standardizing): 7087
  • Unique words (after standardizing): 3639
Novel 3, Madame Bovary:
  • Novel length: 100442 words
  • Unique words (before standardizing): 9676
  • Unique words (after standardizing): 5999

We see that ‘Les liaisons dangeureuses’ is the longest of the 3 novels by word count, but it contains less vocabulary than the two others. In contrast, ‘Madame Bovary’ is the shortest of the 3 novels by word count, but it contains by far the most vocabulary. Perhaps this means that ‘Madame Bovary’ may be better suited to a very advanced learner of French, while the other two novels in our list might be more approachable for intermediate learner of French. We all know that reading a novel in a foreign language becomes less fun when one is constantly having to lookup the definition of words.

Vocabulary overlaps among the 3 book novels

Let’s have a look at the overlaps in vocabulary between the three novels:

There are 1833 vocabulary words which are common to all 3 novels. Unsurprisingly, these common words include pronouns (je, tu, nous, vous), as well as common nouns (victoire, mensonge, plaisanterie, lumière, préoccupation) and many common verbs such as: plaisanter plaindre, souhaiter, réclamer, déshabiller, écrire, séduire, cacher, éclairer, entreprendre.

There are 1292 vocabulary words which are found in ‘la Bête humaine’ but not in the other two novels. These include: docilité, griffe, endiablé, effréné, wagon

There are 908 vocabulary words which are in ‘Les liaisons dangeureuses’ and not in the other two novels. These include: apens, laurier, déshonneur, déplorable, indisposition

Finally there a full 2443 vocabulary words which are in ‘Madame Bovary’ and not in the other two novels. Some of these words are quite uncommon and would be difficult even for many native French speakers. For example:

  • lorgnette: little binoculars or telescope which used to be used in theaters
  • soutane: long robe worn by clergymen
  • tison: a partly burnt piece of wood
  • candélabre: a large chandelier with multiple branches
  • calicot: a rough cotton fabric

But not all these 2443 word are rare and difficult, many are actually simple words that simply happen not to be present in the other two novels. For example: écharpe, rhum, cathédrale, orgue, serviette.

In case you are wondering if the publication dates has something to do with all this, here they are: both ‘La Bête humaine’ and ‘Madame Bovary’ are 19th century novels published respectively in 1890 and 1856. The third novel ‘Les liaisons dangereuses’ is an 18th century novel published in 1782. It would appear that the publication dates don’t play a role in the vocabulary differences which we have observed.


Part of the difficulty in reading literature, particularly when it is read in a foreign language, comes from the vocabulary. Looking up vocabulary words breaks the flow of reading, thus often reducing the enjoyment the reader gets from the novel. We have seen that the amount of vocabulary varies significantly between classic French novels. While it is always nice to learn new vocabulary, one should keep in mind that for the purpose of reading enjoyment it is best to choose novels for which one does not have to lookup too many word definitions.