# Working with text¶

Note

This tutorial is available for interactive use
with IPython Notebook: `Working with text.ipynb`

.

## Creating a document-term matrix¶

Word (or n-gram) frequencies are typical units of analysis when working with text collections. It may come as a surprise that reducing a book to a list of word frequencies retains useful information, but practice has shown this to be the case. Treating texts as a list of word frequencies (a vector) also makes available a vast range of mathematical tools developed for studying and manipulating vectors.

Note

Turning texts into unordered lists (or “bags”) of words is easy in Python. Python Programming for the Humanities includes a chapter entitled Text Processing that describes the steps in detail.

This tutorial assumes some prior exposure to text analysis so we will gather
word frequencies (or term frequencies) associated with texts into
a document-term matrix using the CountVectorizer
class from the scikit-learn package.
(For those familiar with R and the tm package, this function performs
the same operation as `DocumentTermMatrix`

and takes recognizably similar
arguments.)

First we need to import the functions and classes we intend to use, along with
the customary abbreviation for functions in the `numpy`

package.

```
In [1]: import numpy as np # a conventional alias
In [2]: from sklearn.feature_extraction.text import CountVectorizer
```

Now we use the CountVectorizer
class to create a document-term matrix. `CountVectorizer`

is customizable. For
example, a list of “stop words” can be specified with the `stop_words`

parameter. Other important parameters include:

`lowercase`

(default`True`

) convert all text to lowercase before tokenizing`min_df`

(default`1`

) remove terms from the vocabulary that occur in fewer than`min_df`

documents (in a large corpus this may be set to`15`

or higher to eliminate very rare words)`vocabulary`

ignore words that do not appear in the provided list of words`strip_accents`

remove accents`token_pattern`

(default`u'(?u)\b\w\w+\b'`

) regular expression identifying tokens–by default words that consist of a single character (e.g., ‘a’, ‘2’) are ignored, setting`token_pattern`

to`'(?u)\b\w+\b'`

will include these tokens`tokenizer`

(default unused) use a custom function for tokenizing

For this example we will use texts by Jane Austen and Charlotte Brontë. These texts are available in Datasets.

```
In [3]: filenames = ['data/austen-brontë/Austen_Emma.txt',
...: 'data/austen-brontë/Austen_Pride.txt',
...: 'data/austen-brontë/Austen_Sense.txt',
...: 'data/austen-brontë/CBronte_Jane.txt',
...: 'data/austen-brontë/CBronte_Professor.txt',
...: 'data/austen-brontë/CBronte_Villette.txt']
...:
In [4]: vectorizer = CountVectorizer(input='filename')
In [5]: dtm = vectorizer.fit_transform(filenames) # a sparse matrix
In [6]: vocab = vectorizer.get_feature_names() # a list
```

Now we have a document-term matrix and a vocabulary list. Before we can query
the matrix and find out, for example, how many times the word ‘house’ occurs in
*Emma* (the first text in `filenames`

), we need to convert this matrix from
its current format, a sparse matrix, into a normal NumPy
array. We will also convert the Python list storing our vocabulary, `vocab`

,
into a NumPy array, as an array supports a greater variety of operations than
a list.

```
# for reference, note the current class of `dtm`
In [7]: type(dtm)
Out[7]: scipy.sparse.csr.csr_matrix
In [8]: dtm = dtm.toarray() # convert to a regular array
In [9]: vocab = np.array(vocab)
```

Note

A sparse matrix only records non-zero entries and is used to store
matrices that contain a significant number of entries that are zero. To
understand why this matters enough that `CountVectorizer`

returns a sparse
matrix by default, consider a 4000 by 50000 matrix of word frequencies that
is 60% zeros. In Python an integer takes up four bytes, so using a sparse
matrix saves almost 500M of memory, which is a considerable amount of
computer memory in the 2010s. (Recall that Python objects such as arrays are stored in
memory, not on disk). If you are working with a very large collection
of texts, you may encounter memory errors after issuing the commands above.
Provided your corpus is not truly massive, it may be advisable to locate
a machine with a greater amount of memory. For example, these days it is possible to
rent a machine with 64G of memory by the hour. Conducting experiments
on a random subsample (small enough to fit into memory) is also recommended.

With this preparatory work behind us, querying the document-term matrix is
simple. For example, the following demonstrate two ways finding how many times
the word ‘house’ occurs in the first text, *Emma*:

```
# the first file, indexed by 0 in Python, is *Emma*
In [10]: filenames[0] == 'data/austen-brontë/Austen_Emma.txt'
Out[10]: True
# use the standard Python list method index(...)
# list(vocab) or vocab.tolist() will take vocab (an array) and return a list
In [11]: house_idx = list(vocab).index('house')
In [12]: dtm[0, house_idx]
Out[12]: 95
# using NumPy indexing will be more natural for many
# in R this would be essentially the same, dtm[1, vocab == 'house']
In [13]: dtm[0, vocab == 'house']
Out[13]: array([95], dtype=int64)
```

Although dtm is technically a NumPy array, I will keep referring to dtm as
a matrix. Note that NumPy arrays do support matrix operations such as dot
product. (If `X`

and `Y`

have compatible dimensions, `X.dot(Y)`

is the
matrix product \(XY\).)

Note

NumPy does make available a matrix
data structure which can be useful if you are doing lots of matrix
operations such as matrix product, inverse, and so forth. In general,
however, it is best to stick to NumPy arrays. In fact, if you are
using Python 3.5 you can make use of the matrix multiplication operator `@`

and dispense with any need for the `matrix`

type.

Just so we have a sense of what we have just created, here is a section of the document-term matrix for a handful of selected words:

and | emma | home | house | of | the | |
---|---|---|---|---|---|---|

Austen_Emma.txt | 4896 | 865 | 130 | 95 | 4291 | 5201 |

Austen_Pride.txt | 3584 | 0 | 66 | 107 | 3609 | 4330 |

Austen_Sense.txt | 3491 | 0 | 69 | 161 | 3572 | 4105 |

CBronte_Jane.txt | 6626 | 0 | 80 | 182 | 4364 | 7846 |

CBronte_Professor.txt | 2936 | 0 | 37 | 93 | 2663 | 3836 |

CBronte_Villette.txt | 6374 | 0 | 121 | 129 | 4845 | 8363 |

## Comparing texts¶

Arranging our texts in a document-term matrix make available a range of exploratory procedures. For example, calculating a measure of similarity between texts becomes simple. Since each row of the document-term matrix is a sequence of a novel’s word frequencies, it is possible to put mathematical notions of similarity (or distance) between sequences of numbers in service of calculating the similarity (or distnace) between any two novels. One frequently used measure of distance between vectors (a measure easily converted into a measure of similarity) is Euclidean distance. The Euclidean distance between two vectors in the plane should be familiar from geometry, as it is the length of the hypotenuse that joins the two vectors. For instance, consider the Euclidean distance between the vectors \(\vec{x} = (1, 3)\) and \(\vec{y} = (4, 2)\). The distance between the two vectors is \(\sqrt{(1-4)^2 + (3-2)^2} = \sqrt{10}\).

Note

Measures of distance can be converted into measures of similarity. If your measures of distance are all between zero and one, then a measure of similarity could be one minus the distance. (The inverse of the distance would also serve as a measure of similarity.)

Note

More generally, given two vectors \(\vec{x}\) and \(\vec{y}\) in \(p\)-dimensional space, the Euclidean distance between the two vectors is given by

\(||\vec{x} - \vec{y}|| = \sqrt{\sum_{i=1}^p (x_i - y_i)^2}\)

This concept of distance is not restricted to two dimensions. For example, it is not difficult to imagine the figure above translated into three dimensions. We can also persuade ourselves that the measure of distance extends to an arbitrary number of dimensions; for any two matched components in a pair of vectors (such as \(x_2\) and \(y_2\)), differences increase the distance.

Since two novels in our corpus now have an expression as vectors, we can
calculate the Euclidean distance between them. We can do this by hand or we can
avail ourselves of the `scikit-learn`

function `euclidean_distances`

.

```
# "by hand"
In [14]: n, _ = dtm.shape
In [15]: dist = np.zeros((n, n))
In [16]: for i in range(n):
....: for j in range(n):
....: x, y = dtm[i, :], dtm[j, :]
....: dist[i, j] = np.sqrt(np.sum((x - y)**2))
....:
In [17]: from sklearn.metrics.pairwise import euclidean_distances
In [18]: dist = euclidean_distances(dtm)
In [19]: np.round(dist, 1)
Out[19]:
array([[ 0. , 3856.3, 4182.8, 5119.7, 7113.3, 5280.2],
[ 3856.3, 0. , 1922.6, 6313.1, 4126.2, 6381.2],
[ 4182.8, 1922.6, 0. , 6657.4, 4045.3, 6650.3],
[ 5119.7, 6313.1, 6657.4, 0. , 8363.8, 2591.5],
[ 7113.3, 4126.2, 4045.3, 8363.8, 0. , 8484.1],
[ 5280.2, 6381.2, 6650.3, 2591.5, 8484.1, 0. ]])
# *Pride and Prejudice* is index 1 and *Jane Eyre* is index 3
In [20]: filenames[1] == 'data/austen-brontë/Austen_Pride.txt'
Out[20]: True
In [21]: filenames[3] == 'data/austen-brontë/CBronte_Jane.txt'
Out[21]: True
# the distance between *Pride and Prejudice* and *Jane Eyre*
In [22]: dist[1, 3]
Out[22]: 6313.0833987838305
# which is greater than the distance between *Jane Eyre* and *Villette* (index 5)
In [23]: dist[1, 3] > dist[3, 5]
Out[23]: True
```

And if we want to use a measure of distance that takes into consideration the
length of the novels (an excellent idea), we can calculate the cosine
similarity
by importing `sklearn.metrics.pairwise.cosine_similarity`

and use it in place
of euclidean_distances.

Keep in mind that cosine similarity is a measure of similarity (rather than distance) that ranges between 0 and 1 (as it is the cosine of the angle between the two vectors). In order to get a measure of distance (or dissimilarity), we need to “flip” the measure so that a larger angle receives a larger value. The distance measure derived from cosine similarity is therefore one minus the cosine similarity between two vectors.

```
In [24]: from sklearn.metrics.pairwise import cosine_similarity
In [25]: dist = 1 - cosine_similarity(dtm)
In [26]: np.round(dist, 2)
Out[26]:
array([[-0. , 0.02, 0.03, 0.05, 0.06, 0.05],
[ 0.02, 0. , 0.02, 0.05, 0.04, 0.04],
[ 0.03, 0.02, 0. , 0.06, 0.05, 0.05],
[ 0.05, 0.05, 0.06, 0. , 0.02, 0.01],
[ 0.06, 0.04, 0.05, 0.02, -0. , 0.01],
[ 0.05, 0.04, 0.05, 0.01, 0.01, -0. ]])
# the distance between *Pride and Prejudice* (index 1)
# and *Jane Eyre* (index 3) is
In [27]: dist[1, 3]
Out[27]: 0.047026234323162663
# which is greater than the distance between *Jane Eyre* and
# *Villette* (index 5)
In [28]: dist[1, 3] > dist[3, 5]
Out[28]: True
```

Those interested in doing the calculation for themselves can use the following steps:

```
In [29]: norms = np.sqrt(np.sum(dtm * dtm, axis=1, keepdims=True)) # multiplication between arrays is element-wise
In [30]: dtm_normed = dtm / norms
In [31]: similarities = np.dot(dtm_normed, dtm_normed.T)
In [32]: np.round(similarities, 2)
Out[32]:
array([[ 1. , 0.98, 0.97, 0.95, 0.94, 0.95],
[ 0.98, 1. , 0.98, 0.95, 0.96, 0.96],
[ 0.97, 0.98, 1. , 0.94, 0.95, 0.95],
[ 0.95, 0.95, 0.94, 1. , 0.98, 0.99],
[ 0.94, 0.96, 0.95, 0.98, 1. , 0.99],
[ 0.95, 0.96, 0.95, 0.99, 0.99, 1. ]])
# similarities between *Pride and Prejudice* and *Jane Eyre* is
In [33]: similarities[1, 3]
Out[33]: 0.95297376567683734
```

Austen_Emma | Austen_Pride | Austen_Sense | CBronte_Jane | CBronte_Professor | CBronte_Villette | |
---|---|---|---|---|---|---|

Austen_Emma | -0.00 | 0.02 | 0.03 | 0.05 | 0.06 | 0.05 |

Austen_Pride | 0.02 | 0.00 | 0.02 | 0.05 | 0.04 | 0.04 |

Austen_Sense | 0.03 | 0.02 | 0.00 | 0.06 | 0.05 | 0.05 |

CBronte_Jane | 0.05 | 0.05 | 0.06 | 0.00 | 0.02 | 0.01 |

CBronte_Professor | 0.06 | 0.04 | 0.05 | 0.02 | -0.00 | 0.01 |

CBronte_Villette | 0.05 | 0.04 | 0.05 | 0.01 | 0.01 | -0.00 |

## Visualizing distances¶

It is often desirable to visualize the pairwise distances between our texts.
A general approach to visualizing distances is to assign a point in a plane to
each text, making sure that the distance between points is proportional to the
pairwise distances we calculated. This kind of visualization is common enough
that it has a name, “multidimensional scaling” (MDS) and family of
functions in `scikit-learn`

(and R too, see `mdscale`

).

```
In [34]: import os # for os.path.basename
In [35]: import matplotlib.pyplot as plt
In [36]: from sklearn.manifold import MDS
# two components as we're plotting points in a two-dimensional plane
# "precomputed" because we provide a distance matrix
# we will also specify `random_state` so the plot is reproducible.
In [37]: mds = MDS(n_components=2, dissimilarity="precomputed", random_state=1)
In [38]: pos = mds.fit_transform(dist) # shape (n_components, n_samples)
```

```
In [39]: xs, ys = pos[:, 0], pos[:, 1]
# short versions of filenames:
# convert 'data/austen-brontë/Austen_Emma.txt' to 'Austen_Emma'
In [40]: names = [os.path.basename(fn).replace('.txt', '') for fn in filenames]
# color-blind-friendly palette
In [41]: for x, y, name in zip(xs, ys, names):
....: color = 'orange' if "Austen" in name else 'skyblue'
....: plt.scatter(x, y, c=color)
....: plt.text(x, y, name)
....:
In [42]: plt.show()
```

We can also do MDS in three dimensions:

```
# après Jeremy M. Stober, Tim Vieira
# https://github.com/timvieira/viz/blob/master/mds.py
In [43]: mds = MDS(n_components=3, dissimilarity="precomputed", random_state=1)
In [44]: pos = mds.fit_transform(dist)
```

```
In [45]: from mpl_toolkits.mplot3d import Axes3D
In [46]: fig = plt.figure()
In [47]: ax = fig.add_subplot(111, projection='3d')
In [48]: ax.scatter(pos[:, 0], pos[:, 1], pos[:, 2])
Out[48]: <mpl_toolkits.mplot3d.art3d.Path3DCollection at 0x2b96c03c1470>
In [49]: for x, y, z, s in zip(pos[:, 0], pos[:, 1], pos[:, 2], names):
....: ax.text(x, y, z, s)
....:
In [50]: plt.show()
```

## Clustering texts based on distance¶

Clustering texts into discrete groups of similar texts is often a useful exploratory step. For example, a researcher may be wondering if certain textual features partition a collection of texts by author or by genre. Pairwise distances alone do not produce any kind of classification. To put a set of distance measurements to work in classification requires additional assumptions, such as a definition of a group or cluster.

The ideas underlying the transition from distances to clusters are, for the most part, common sense. Any clustering of texts should result in texts that are closer to each other (in the distance matrix) residing in the same cluster. There are many ways of satisfying this requirement; there no unique clustering based on distances that is the “best”. One strategy for clustering in circulation is called Ward’s method. Rather than producing a single clustering, Ward’s method produces a hierarchy of clusterings, as we will see in a moment. All that Ward’s method requires is a set of pairwise distance measurements–such as those we calculated a moment ago. Ward’s method produces a hierarchical clustering of texts via the following procedure:

- Start with each text in its own cluster
- Until only a single cluster remains,
- Find the closest clusters and merge them. The distance between two clusters is the change in the sum of squared distances when they are merged.

- Return a tree containing a record of cluster-merges.

The function scipy.cluster.hierarchy.ward performs
this algorithm and returns a tree of cluster-merges. The hierarchy of clusters
can be visualized using `scipy.cluster.hierarchy.dendrogram`

.

```
In [51]: from scipy.cluster.hierarchy import ward, dendrogram
In [52]: linkage_matrix = ward(dist)
# match dendrogram to that returned by R's hclust()
In [53]: dendrogram(linkage_matrix, orientation="right", labels=names)
Out[53]:
{'color_list': ['g', 'g', 'r', 'r', 'b'],
'dcoord': [[0.0, 0.016230837530893799, 0.016230837530893799, 0.0],
[0.0, 0.025545848899443196, 0.025545848899443196, 0.016230837530893799],
[0.0, 0.026664938673982768, 0.026664938673982768, 0.0],
[0.0, 0.039973173157413638, 0.039973173157413638, 0.026664938673982768],
[0.025545848899443196,
0.16535482370281634,
0.16535482370281634,
0.039973173157413638]],
'icoord': [[15.0, 15.0, 25.0, 25.0],
[5.0, 5.0, 20.0, 20.0],
[45.0, 45.0, 55.0, 55.0],
[35.0, 35.0, 50.0, 50.0],
[12.5, 12.5, 42.5, 42.5]],
'ivl': ['CBronte_Jane',
'CBronte_Professor',
'CBronte_Villette',
'Austen_Emma',
'Austen_Pride',
'Austen_Sense'],
'leaves': [3, 4, 5, 0, 1, 2]}
In [54]: plt.tight_layout() # fixes margins
In [55]: plt.show()
```

For those familiar with R, the procedure is performed as follows:

```
labels = c('Austen_Emma', 'Austen_Pride', 'Austen_Sense', 'CBronte_Jane',
'CBronte_Professor', 'CBronte_Villette')
dtm_normed = dtm / rowSums(dtm)
dist_matrix = dist(dtm_normed)
tree = hclust(dist_matrix, method="ward")
plot(tree, labels=labels)
```

## Exercises¶

- Find two different ways of determining the number of times the word
‘situation’ appears in
*Emma*. (Make sure the methods produce the same result.) - Working with the strings below as documents and using
`CountVectorizer`

with the`input='content'`

parameter, create a document-term matrix. Apart from the`input`

parameter, use the default settings.

```
In [56]: text1 = "Indeed, she had a rather kindly disposition."
In [57]: text2 = "The real evils, indeed, of Emma's situation were the power of having rather too much her own way, and a disposition to think a little too well of herself;"
In [58]: text3 = "The Jaccard distance is a way of measuring the distance from one set to another set."
```

- Using the document-term matrix just created, calculate the Euclidean distance, Jaccard distance, and cosine distance between each pair of documents. Make sure to calculate distance (rather than similarity). Are our intuitions about which texts are most similar reflected in the measurements of distance?

*For solutions, view the source for this document.*