Working with text


This tutorial is available for interactive use with IPython Notebook: Working with text.ipynb.

Creating a document-term matrix

Word (or n-gram) frequencies are typical units of analysis when working with text collections. It may come as a surprise that reducing a book to a list of word frequencies retains useful information, but practice has shown this to be the case. Treating texts as a list of word frequencies (a vector) also makes available a vast range of mathematical tools developed for studying and manipulating vectors.


Turning texts into unordered lists (or “bags”) of words is easy in Python. Python Programming for the Humanities includes a chapter entitled Text Processing that describes the steps in detail.

This tutorial assumes some prior exposure to text analysis so we will gather word frequencies (or term frequencies) associated with texts into a document-term matrix using the CountVectorizer class from the scikit-learn package. (For those familiar with R and the tm package, this function performs the same operation as DocumentTermMatrix and takes recognizably similar arguments.)

First we need to import the functions and classes we intend to use, along with the customary abbreviation for functions in the numpy package.

In [1]: import numpy as np  # a conventional alias

In [2]: from sklearn.feature_extraction.text import CountVectorizer

Now we use the CountVectorizer class to create a document-term matrix. CountVectorizer is customizable. For example, a list of “stop words” can be specified with the stop_words parameter. Other important parameters include:

  • lowercase (default True) convert all text to lowercase before tokenizing
  • min_df (default 1) remove terms from the vocabulary that occur in fewer than min_df documents (in a large corpus this may be set to 15 or higher to eliminate very rare words)
  • vocabulary ignore words that do not appear in the provided list of words
  • strip_accents remove accents
  • token_pattern (default u'(?u)\b\w\w+\b') regular expression identifying tokens–by default words that consist of a single character (e.g., ‘a’, ‘2’) are ignored, setting token_pattern to '(?u)\b\w+\b' will include these tokens
  • tokenizer (default unused) use a custom function for tokenizing

For this example we will use texts by Jane Austen and Charlotte Brontë. These texts are available in Datasets.

In [3]: filenames = ['data/austen-brontë/Austen_Emma.txt',
   ...:              'data/austen-brontë/Austen_Pride.txt',
   ...:              'data/austen-brontë/Austen_Sense.txt',
   ...:              'data/austen-brontë/CBronte_Jane.txt',
   ...:              'data/austen-brontë/CBronte_Professor.txt',
   ...:              'data/austen-brontë/CBronte_Villette.txt']

In [4]: vectorizer = CountVectorizer(input='filename')

In [5]: dtm = vectorizer.fit_transform(filenames)  # a sparse matrix

In [6]: vocab = vectorizer.get_feature_names()  # a list

Now we have a document-term matrix and a vocabulary list. Before we can query the matrix and find out, for example, how many times the word ‘house’ occurs in Emma (the first text in filenames), we need to convert this matrix from its current format, a sparse matrix, into a normal NumPy array. We will also convert the Python list storing our vocabulary, vocab, into a NumPy array, as an array supports a greater variety of operations than a list.

# for reference, note the current class of `dtm`
In [7]: type(dtm)
Out[7]: scipy.sparse.csr.csr_matrix

In [8]: dtm = dtm.toarray()  # convert to a regular array

In [9]: vocab = np.array(vocab)


A sparse matrix only records non-zero entries and is used to store matrices that contain a significant number of entries that are zero. To understand why this matters enough that CountVectorizer returns a sparse matrix by default, consider a 4000 by 50000 matrix of word frequencies that is 60% zeros. In Python an integer takes up four bytes, so using a sparse matrix saves almost 500M of memory, which is a considerable amount of computer memory in the 2010s. (Recall that Python objects such as arrays are stored in memory, not on disk). If you are working with a very large collection of texts, you may encounter memory errors after issuing the commands above. Provided your corpus is not truly massive, it may be advisable to locate a machine with a greater amount of memory. For example, these days it is possible to rent a machine with 64G of memory by the hour. Conducting experiments on a random subsample (small enough to fit into memory) is also recommended.

With this preparatory work behind us, querying the document-term matrix is simple. For example, the following demonstrate two ways finding how many times the word ‘house’ occurs in the first text, Emma:

# the first file, indexed by 0 in Python, is *Emma*
In [10]: filenames[0] == 'data/austen-brontë/Austen_Emma.txt'
Out[10]: True

# use the standard Python list method index(...)
# list(vocab) or vocab.tolist() will take vocab (an array) and return a list
In [11]: house_idx = list(vocab).index('house')

In [12]: dtm[0, house_idx]
Out[12]: 95

# using NumPy indexing will be more natural for many
# in R this would be essentially the same, dtm[1, vocab == 'house']
In [13]: dtm[0, vocab == 'house']
Out[13]: array([95], dtype=int64)

Although dtm is technically a NumPy array, I will keep referring to dtm as a matrix. Note that NumPy arrays do support matrix operations such as dot product. (If X and Y have compatible dimensions, is the matrix product \(XY\).)


NumPy does make available a matrix data structure which can be useful if you are doing lots of matrix operations such as matrix product, inverse, and so forth. In general, however, it is best to stick to NumPy arrays. In fact, if you are using Python 3.5 you can make use of the matrix multiplication operator @ and dispense with any need for the matrix type.

Just so we have a sense of what we have just created, here is a section of the document-term matrix for a handful of selected words:

and emma home house of the
Austen_Emma.txt 4896 865 130 95 4291 5201
Austen_Pride.txt 3584 0 66 107 3609 4330
Austen_Sense.txt 3491 0 69 161 3572 4105
CBronte_Jane.txt 6626 0 80 182 4364 7846
CBronte_Professor.txt 2936 0 37 93 2663 3836
CBronte_Villette.txt 6374 0 121 129 4845 8363

Comparing texts

Arranging our texts in a document-term matrix make available a range of exploratory procedures. For example, calculating a measure of similarity between texts becomes simple. Since each row of the document-term matrix is a sequence of a novel’s word frequencies, it is possible to put mathematical notions of similarity (or distance) between sequences of numbers in service of calculating the similarity (or distnace) between any two novels. One frequently used measure of distance between vectors (a measure easily converted into a measure of similarity) is Euclidean distance. The Euclidean distance between two vectors in the plane should be familiar from geometry, as it is the length of the hypotenuse that joins the two vectors. For instance, consider the Euclidean distance between the vectors \(\vec{x} = (1, 3)\) and \(\vec{y} = (4, 2)\). The distance between the two vectors is \(\sqrt{(1-4)^2 + (3-2)^2} = \sqrt{10}\).


Measures of distance can be converted into measures of similarity. If your measures of distance are all between zero and one, then a measure of similarity could be one minus the distance. (The inverse of the distance would also serve as a measure of similarity.)

\useasboundingbox (0,0) rectangle (5,5);
 \draw [<->,thick] (0,5) node (yaxis) [above] {} |- (5,0) node (xaxis) [right] {};
 \draw[step=1cm,gray,very thin] (0,0) grid (5,5);

 \draw [->, thick] (0,0) -- (1,3);
 \draw (1,3) node [above] {$(1,3) = \vec{x}$};

 \draw [->, thick] (0,0) -- (4,2);
 \draw (4,1.7) node [below] {$(4,2) =\vec{y}$};

 \draw [-, orange] (1,3) -- (4,2);
 \draw (3.3,2.5) node [above, orange] {$||\vec{x} - \vec{y}|| = \sqrt{10}$};

Distance between two vectors


More generally, given two vectors \(\vec{x}\) and \(\vec{y}\) in \(p\)-dimensional space, the Euclidean distance between the two vectors is given by

\(||\vec{x} - \vec{y}|| = \sqrt{\sum_{i=1}^p (x_i - y_i)^2}\)

This concept of distance is not restricted to two dimensions. For example, it is not difficult to imagine the figure above translated into three dimensions. We can also persuade ourselves that the measure of distance extends to an arbitrary number of dimensions; for any two matched components in a pair of vectors (such as \(x_2\) and \(y_2\)), differences increase the distance.

Since two novels in our corpus now have an expression as vectors, we can calculate the Euclidean distance between them. We can do this by hand or we can avail ourselves of the scikit-learn function euclidean_distances.

# "by hand"
In [14]: n, _ = dtm.shape

In [15]: dist = np.zeros((n, n))

In [16]: for i in range(n):
   ....:     for j in range(n):
   ....:         x, y = dtm[i, :], dtm[j, :]
   ....:         dist[i, j] = np.sqrt(np.sum((x - y)**2))

In [17]: from sklearn.metrics.pairwise import euclidean_distances

In [18]: dist = euclidean_distances(dtm)

In [19]: np.round(dist, 1)
array([[    0. ,  3856.3,  4182.8,  5119.7,  7113.3,  5280.2],
       [ 3856.3,     0. ,  1922.6,  6313.1,  4126.2,  6381.2],
       [ 4182.8,  1922.6,     0. ,  6657.4,  4045.3,  6650.3],
       [ 5119.7,  6313.1,  6657.4,     0. ,  8363.8,  2591.5],
       [ 7113.3,  4126.2,  4045.3,  8363.8,     0. ,  8484.1],
       [ 5280.2,  6381.2,  6650.3,  2591.5,  8484.1,     0. ]])

# *Pride and Prejudice* is index 1 and *Jane Eyre* is index 3
In [20]: filenames[1] == 'data/austen-brontë/Austen_Pride.txt'
Out[20]: True

In [21]: filenames[3] == 'data/austen-brontë/CBronte_Jane.txt'
Out[21]: True

# the distance between *Pride and Prejudice* and *Jane Eyre*
In [22]: dist[1, 3]
Out[22]: 6313.0833987838305

# which is greater than the distance between *Jane Eyre* and *Villette* (index 5)
In [23]: dist[1, 3] > dist[3, 5]
Out[23]: True

And if we want to use a measure of distance that takes into consideration the length of the novels (an excellent idea), we can calculate the cosine similarity by importing sklearn.metrics.pairwise.cosine_similarity and use it in place of euclidean_distances.

Keep in mind that cosine similarity is a measure of similarity (rather than distance) that ranges between 0 and 1 (as it is the cosine of the angle between the two vectors). In order to get a measure of distance (or dissimilarity), we need to “flip” the measure so that a larger angle receives a larger value. The distance measure derived from cosine similarity is therefore one minus the cosine similarity between two vectors.

In [24]: from sklearn.metrics.pairwise import cosine_similarity

In [25]: dist = 1 - cosine_similarity(dtm)

In [26]: np.round(dist, 2)
array([[-0.  ,  0.02,  0.03,  0.05,  0.06,  0.05],
       [ 0.02,  0.  ,  0.02,  0.05,  0.04,  0.04],
       [ 0.03,  0.02,  0.  ,  0.06,  0.05,  0.05],
       [ 0.05,  0.05,  0.06,  0.  ,  0.02,  0.01],
       [ 0.06,  0.04,  0.05,  0.02, -0.  ,  0.01],
       [ 0.05,  0.04,  0.05,  0.01,  0.01, -0.  ]])

# the distance between *Pride and Prejudice* (index 1)
# and *Jane Eyre* (index 3) is
In [27]: dist[1, 3]
Out[27]: 0.047026234323162663

# which is greater than the distance between *Jane Eyre* and
# *Villette* (index 5)
In [28]: dist[1, 3] > dist[3, 5]
Out[28]: True

Those interested in doing the calculation for themselves can use the following steps:

In [29]: norms = np.sqrt(np.sum(dtm * dtm, axis=1, keepdims=True))  # multiplication between arrays is element-wise

In [30]: dtm_normed = dtm / norms

In [31]: similarities =, dtm_normed.T)

In [32]: np.round(similarities, 2)
array([[ 1.  ,  0.98,  0.97,  0.95,  0.94,  0.95],
       [ 0.98,  1.  ,  0.98,  0.95,  0.96,  0.96],
       [ 0.97,  0.98,  1.  ,  0.94,  0.95,  0.95],
       [ 0.95,  0.95,  0.94,  1.  ,  0.98,  0.99],
       [ 0.94,  0.96,  0.95,  0.98,  1.  ,  0.99],
       [ 0.95,  0.96,  0.95,  0.99,  0.99,  1.  ]])

# similarities between *Pride and Prejudice* and *Jane Eyre* is
In [33]: similarities[1, 3]
Out[33]: 0.95297376567683734
Austen_Emma Austen_Pride Austen_Sense CBronte_Jane CBronte_Professor CBronte_Villette
Austen_Emma -0.00 0.02 0.03 0.05 0.06 0.05
Austen_Pride 0.02 0.00 0.02 0.05 0.04 0.04
Austen_Sense 0.03 0.02 0.00 0.06 0.05 0.05
CBronte_Jane 0.05 0.05 0.06 0.00 0.02 0.01
CBronte_Professor 0.06 0.04 0.05 0.02 -0.00 0.01
CBronte_Villette 0.05 0.04 0.05 0.01 0.01 -0.00

Visualizing distances

It is often desirable to visualize the pairwise distances between our texts. A general approach to visualizing distances is to assign a point in a plane to each text, making sure that the distance between points is proportional to the pairwise distances we calculated. This kind of visualization is common enough that it has a name, “multidimensional scaling” (MDS) and family of functions in scikit-learn (and R too, see mdscale).

In [34]: import os  # for os.path.basename

In [35]: import matplotlib.pyplot as plt

In [36]: from sklearn.manifold import MDS

# two components as we're plotting points in a two-dimensional plane
# "precomputed" because we provide a distance matrix
# we will also specify `random_state` so the plot is reproducible.
In [37]: mds = MDS(n_components=2, dissimilarity="precomputed", random_state=1)

In [38]: pos = mds.fit_transform(dist)  # shape (n_components, n_samples)
In [39]: xs, ys = pos[:, 0], pos[:, 1]

# short versions of filenames:
# convert 'data/austen-brontë/Austen_Emma.txt' to 'Austen_Emma'
In [40]: names = [os.path.basename(fn).replace('.txt', '') for fn in filenames]

# color-blind-friendly palette
In [41]: for x, y, name in zip(xs, ys, names):
   ....:     color = 'orange' if "Austen" in name else 'skyblue'
   ....:     plt.scatter(x, y, c=color)
   ....:     plt.text(x, y, name)

In [42]:

We can also do MDS in three dimensions:

# après Jeremy M. Stober, Tim Vieira
In [43]: mds = MDS(n_components=3, dissimilarity="precomputed", random_state=1)

In [44]: pos = mds.fit_transform(dist)
In [45]: from mpl_toolkits.mplot3d import Axes3D

In [46]: fig = plt.figure()

In [47]: ax = fig.add_subplot(111, projection='3d')

In [48]: ax.scatter(pos[:, 0], pos[:, 1], pos[:, 2])
Out[48]: <mpl_toolkits.mplot3d.art3d.Path3DCollection at 0x2b96c03c1470>

In [49]: for x, y, z, s in zip(pos[:, 0], pos[:, 1], pos[:, 2], names):
   ....:     ax.text(x, y, z, s)

In [50]:

Clustering texts based on distance

Clustering texts into discrete groups of similar texts is often a useful exploratory step. For example, a researcher may be wondering if certain textual features partition a collection of texts by author or by genre. Pairwise distances alone do not produce any kind of classification. To put a set of distance measurements to work in classification requires additional assumptions, such as a definition of a group or cluster.

The ideas underlying the transition from distances to clusters are, for the most part, common sense. Any clustering of texts should result in texts that are closer to each other (in the distance matrix) residing in the same cluster. There are many ways of satisfying this requirement; there no unique clustering based on distances that is the “best”. One strategy for clustering in circulation is called Ward’s method. Rather than producing a single clustering, Ward’s method produces a hierarchy of clusterings, as we will see in a moment. All that Ward’s method requires is a set of pairwise distance measurements–such as those we calculated a moment ago. Ward’s method produces a hierarchical clustering of texts via the following procedure:

  1. Start with each text in its own cluster
  2. Until only a single cluster remains,
    • Find the closest clusters and merge them. The distance between two clusters is the change in the sum of squared distances when they are merged.
  3. Return a tree containing a record of cluster-merges.

The function scipy.cluster.hierarchy.ward performs this algorithm and returns a tree of cluster-merges. The hierarchy of clusters can be visualized using scipy.cluster.hierarchy.dendrogram.

In [51]: from scipy.cluster.hierarchy import ward, dendrogram

In [52]: linkage_matrix = ward(dist)

# match dendrogram to that returned by R's hclust()
In [53]: dendrogram(linkage_matrix, orientation="right", labels=names)
{'color_list': ['g', 'g', 'r', 'r', 'b'],
 'dcoord': [[0.0, 0.016230837530893799, 0.016230837530893799, 0.0],
  [0.0, 0.025545848899443196, 0.025545848899443196, 0.016230837530893799],
  [0.0, 0.026664938673982768, 0.026664938673982768, 0.0],
  [0.0, 0.039973173157413638, 0.039973173157413638, 0.026664938673982768],
 'icoord': [[15.0, 15.0, 25.0, 25.0],
  [5.0, 5.0, 20.0, 20.0],
  [45.0, 45.0, 55.0, 55.0],
  [35.0, 35.0, 50.0, 50.0],
  [12.5, 12.5, 42.5, 42.5]],
 'ivl': ['CBronte_Jane',
 'leaves': [3, 4, 5, 0, 1, 2]}

In [54]: plt.tight_layout()  # fixes margins

In [55]:

For those familiar with R, the procedure is performed as follows:

labels = c('Austen_Emma', 'Austen_Pride', 'Austen_Sense', 'CBronte_Jane',
           'CBronte_Professor', 'CBronte_Villette')
dtm_normed = dtm / rowSums(dtm)
dist_matrix = dist(dtm_normed)
tree = hclust(dist_matrix, method="ward")
plot(tree, labels=labels)


  1. Find two different ways of determining the number of times the word ‘situation’ appears in Emma. (Make sure the methods produce the same result.)
  2. Working with the strings below as documents and using CountVectorizer with the input='content' parameter, create a document-term matrix. Apart from the input parameter, use the default settings.
In [56]: text1 = "Indeed, she had a rather kindly disposition."

In [57]: text2 = "The real evils, indeed, of Emma's situation were the power of having rather too much her own way, and a disposition to think a little too well of herself;"

In [58]: text3 = "The Jaccard distance is a way of measuring the distance from one set to another set."
  1. Using the document-term matrix just created, calculate the Euclidean distance, Jaccard distance, and cosine distance between each pair of documents. Make sure to calculate distance (rather than similarity). Are our intuitions about which texts are most similar reflected in the measurements of distance?

For solutions, view the source for this document.