Classification, Machine Learning, and Logistic Regression

Previous tutorials have illustrated how probabilistic topic models can be used to navigate a large corpus. As general purpose tools for identifying recurrent themes in a corpus, topic models and non-negative matrix factorization are useful. They perform better than methods previously used for similar purposes, such as principle component analysis (PCA) and latent semantic analysis (LSA). For tasks such as classifying texts into a known set of categories, however, there exist methods that are better suited to the problem. One family of such methods goes under the heading of neural networks (or, more recently, “deep learning”). An essential conceptual and practical building block for these methods is logistic regression, which we will review briefly in this tutorial.


Discussion of the role of logistic regression in neural networks may be found in section 5.1 of Bishop (2007) [Bis07].

Predicting genre classifications

The bag-of-words model is a horrible model of a text. Its failure to distinguish word order (‘the cat ate the fish’ from ‘the fish ate the cat’) is the least of its failings. In most cases, knowing the frequency with which a word occurs in a text tells us very little. Without additional context it is difficult to know how to interpret a word’s frequency. For example, the word ‘heart’ might occur in a discussion of courtly love, of physical exercise, or in a cookbook (e.g., “heart of palm”). And even when a word seems to have a single interpretation, its meaning may depend on words occurring around it.

Nevertheless, sometimes the frequency of words appears to be correlated with useful information, such as pre-existing classifications (or classifications in which we happen to believe). Consider the word “ennemis” (“enemies”) in the context of a corpus of French classical theatre. This corpus includes only plays classified as tragedy or comedy. The word “ennemis” is not, at first glance, a word particularly troubled by problems of polysemy. Considered as an indicator of whether or not a play is a tragedy or a comedy, the frequency of “ennemis” seems to be a reliable guide; the word tends to occur more often in tragedies.

The first way we can verify this is simply to calculate the percentage of plays classified as tragedy in which the word ‘ennemis’ occurs and compare that percentage with the corresponding percentage for comedies. As usual, in order to have a better sense of the variability of language in French classical theatre, we have split the plays into approximately 1,000-word sections.

In [1]: import os

In [2]: import numpy as np

In [3]: from sklearn.feature_extraction.text import CountVectorizer

In [4]: data_dir = 'data/french-tragedies-and-comedies-split/'

In [5]: filenames = np.array(os.listdir(data_dir))

In [6]: filenames_with_path = [os.path.join(data_dir, fn) for fn in filenames]

# tragedies and comedies are coded with 'TR' or 'CO',
# e.g., PCorneille_TR-V-1647-Heraclius0001.txt
In [7]: genre = []

In [8]: for fn in filenames:
   ...:     genre.append('tragedy' if '_TR-' in fn else 'comedy')

In [9]: genre = np.array(genre)

# .strip() removes the trailing newline '\n' from each line in the file
In [10]: french_stopwords = [l.strip() for l in open('data/stopwords/french.txt')]

In [11]: vectorizer = CountVectorizer(input='filename', min_df=15, max_df=.95, stop_words=french_stopwords, max_features=3000)

In [12]: dtm = vectorizer.fit_transform(filenames_with_path)

In [13]: dtm = dtm.toarray()

In [14]: vocab = np.array(vectorizer.get_feature_names())

# texts are split into documents of approximately equal length, so we will
# skip the normalization step and deal directly with counts

Having assembled the corpus, it is easy to calculate the number of play sections in which ‘ennemis’ occurs.

In [15]: word = "ennemis"

In [16]: tragedy_counts = dtm[genre == 'tragedy', vocab == word]

In [17]: comedy_counts = dtm[genre == 'comedy', vocab == word]

# tragedy percentage
In [18]: np.count_nonzero(tragedy_counts) / len(tragedy_counts)
Out[18]: 0.3329577464788732

# comedy percentage
In [19]: np.count_nonzero(comedy_counts) / len(comedy_counts)
Out[19]: 0.058645707376058044

# overall percentage
In [20]: np.count_nonzero(dtm[:, vocab == word]) / len(dtm)
Out[20]: 0.2006415864683581

# text in which "ennemis" appears the most
In [21]: filenames[np.argmax(dtm[:, vocab == word])], np.max(dtm[:, vocab == word])
Out[21]: ('Racine_TR-V-1664-Thebaide0007.txt', 6)

In our sample, if a play section is a tragedy it features the word ‘ennemis’ about a third of time. Among comedy sections, the word appears in only five percent. (Recall, however, that in the majority of play sections the word does not appear at all.) While this gives us a rough sense of the relationship between the word ‘ennemis’ and genre, we may want to describe the relationship more precisely. First, we would like to consider the relationship between the word’s frequency (rather than just its presence or absence) and a text’s classification. Second, we want to predict the classification of a section of a play for which we do not have a classification ready at hand. Logistic regression accomplishes both of these tasks.

Like linear regression, logistic regression will happily make predictions based on aleatory patterns in our data. It is therefore important to make sure we have some additional basis for believing there might be a correlation between the frequency of the word ‘ennemis’ and a genre classification. Our intuition tells us that the word (particularly in its plural form) does not belong in a comedy (or at least not in any great frequency), whereas we can imagine a variety of sentences using the word appearing in a tragedy. Consider, for example, the section of Racine’s Thebaide which features the six occurrences of the word (and plenty of ‘ennemi’ as well):

Plus qu'à mes ennemis la guerre m'est mortelle,
Et le courroux du ciel me la rend trop cruelle ;
Il s'arme contre moi de mon propre dessein,
Il se sert de mon bras pour me percer le sein.
La guerre s'allumait, lorsque pour mon supplice,
Hémon m'abandonna pour servir Polynice ;
Les deux frères par moi devinrent ennemis,
Et je devins, Attale, ennemi de mon fils.

In quantitative text analysis, a common way to represent a classification is as a binary outcome, e.g., 0 for comedy or 1 for tragedy. Whereas linear regression relates some quantity x to another quantity y, logistic regression relates a quantity x to the probability of something being a member of one of two groups, that is, the probability of y having a value of 1.

For reasons covered in greater detail at the end of this section, the probability of classification is expressed not in terms of probability (from 0 to 1) but in log odds. This is not a mysterious transformation. Indeed, in certain countries (and among individuals involved in gambling) expressing the likelihood of an event in terms of odds is common. Moving between probability, odds, and log odds is somewhat tedious but not difficult—e.g., an event occurring with probability 0.75, it occurs with odds 3 (often expressed 3:1) and with log odds 1.1. Logistic regression delivers, for any value of x, here the frequency of the word ‘ennemis’, the log odds of a play section being from a tragedy. Typically we immediately convert the log odds into probability as the latter is more familiar.


For very rare or very probable events using odds (and even log odds) can be preferable to using probabilities. Consider the Intergovernmental Panel on Climate Change’s guidance on addressing uncertainties.

Terminology Likelihood Odds Log odds
Virtually certain 99% probability 99 (or 99:1) > 4.6
Very likely > 90% probability > 9 > 2.2
Likely > 66% probability > 2 > 0.7
About as likely as not 33 to 66% probability 0.5 to 2 -0.7 to 0.7
Unlikely < 33% probability < 0.5 < -0.7
Very unlikely < 10% probability < .1 < -2.2
Exceptionally unlikely < 1% probability < 0.01 < -4.6

Note that whereas moving from a likelihood of 33% to 66% corresponds to moving from 0.5 to 2 on the odds scale, moving from 90% to 99% entails moving from 9 to 99 on the odds scale. The odds scale expresses better the difference between an event that happens 9 out of 10 times versus an event that happens 99 times out of 100.

First we will fit the logistic regression model using the statsmodels package and then, converting from log odds to the more familiar scale of probability, we will plot this estimated relationship.

In [22]: import statsmodels.api as sm

In [23]: wordfreq = dtm[:, vocab == "ennemis"]

# we need to add an intercept (whose coefficient is related to the
# probability that a novel will be classified a tragedy when the
# predictor is zero.
# This is done automatically in R and by sklearn's LogisticRegression
In [24]: X = sm.add_constant(wordfreq)

In [25]: model = sm.GLM(genre == 'tragedy', X, family=sm.families.Binomial())

In [26]: fit =

In [27]: fit.params
Out[27]: array([-0.24896858,  1.53947171])

For those accustomed to fitting regression models in R, the following code produces precisely the same results:

data = data.frame(wordfreq = wordfreq, genre = genre == 'tragedy')
fit = glm(genre ~ wordfreq, data = data, family = binomial(link="logit"))

# note that R is implicitly adding a constant term. We can make this
# term explicit in our model if we choose (the results should be the same)
fit = glm(genre ~ 1 + wordfreq, data = data, family = binomial(link="logit"))

Using the fitted parameters of the model we can make a prediction for any given word frequency. For example, the probability of a section in which ‘ennemis’ occurs twice given by

In [28]: def invlogit(x):
   ....:     """Convert from log odds to probability"""
   ....:     return 1/(1+np.exp(-x))

In [29]: x = 2

In [30]: invlogit(fit.params[0] + fit.params[1] * x)
Out[30]: 0.94427427818274756

The following code plots the relationship between a section’s word frequency and the model’s estimate of the probability of a section being from a tragedy. The points on the figure mark the observations in the corpus. (The points have been jittered to improve readability.)

In [31]: xs = np.arange(min(wordfreq), max(wordfreq) + 1, 0.1)

In [32]: ys = np.array([invlogit(x) for x in xs])

In [33]: plt.plot(xs, ys, linewidth=2)
Out[33]: [<matplotlib.lines.Line2D at 0x2b10c135c0f0>]

# jitter the outcomes (0 or 1) a bit
In [34]: jitter = np.random.random(len(genre)) / 5

In [35]: ys_outcomes = np.abs((genre == 'tragedy') - 0.01 - jitter)

In [36]: alpha = 0.7

# use different colors for the different classes
In [37]: plt.plot(wordfreq[genre == 'tragedy'], ys_outcomes[genre == 'tragedy'], 'b.', alpha=alpha)
Out[37]: [<matplotlib.lines.Line2D at 0x2b10c12c8438>]

In [38]: plt.plot(wordfreq[genre != 'tragedy'], ys_outcomes[genre != 'tragedy'], 'y.', alpha=alpha)
Out[38]: [<matplotlib.lines.Line2D at 0x2b10c12c8358>]

In [39]: plt.xlabel("Word frequency")
Out[39]: <matplotlib.text.Text at 0x2b10c10a3278>

In [40]: plt.ylabel("Predicted probability of play section being a tragedy")
Out[40]: <matplotlib.text.Text at 0x2b10c0ff4860>

In [41]: plt.title("Predicting genre by the frequency of 'ennemis'")
Out[41]: <matplotlib.text.Text at 0x2b10c0ff0ba8>

# make some final aesthetic adjustments of the plot boundary
In [42]: plt.xlim(-0.1, max(wordfreq) + 0.2); plt.tight_layout()

The figure illustrates what the model infers: if ‘ennemis’ appears more than three times in a section it will tend to be a tragedy with high probability.

As an experiment and an illustration of cross validation (also called out-of-sample validation), consider the task of predicting the classification of a section of text based on the frequency of ‘ennemis’ alone. From the 3,429 play sections in our corpus we will take one third of them at random and ask the model to predict their classification with the model fitted on the remaining sections. We will do this three times (once for each held-out third). The scikit-learn package makes this procedure embarrassingly easy, provided we use its version of logistic regression, which is designed for large datasets and differs slightly from the version provided by R and statsmodels. [1]

In [43]: from sklearn import cross_validation

In [44]: from sklearn import linear_model

In [45]: clf = linear_model.LogisticRegression()

In [46]: cross_validation.cross_val_score(clf, wordfreq, genre == 'tragedy')
Out[46]: array([ 0.63199301,  0.62467192,  0.62259194])

Since ‘ennemis’ only appears in 20% of the sections and appears more than once in only 5% of the sections, the model will only have useful information to work with in a fraction of the cases presented to it. Nevertheless, it does considerably better than a baseline of simply picking ‘tragedy’ every time, which would be expected to achieve 52% accuracy, as sections from tragedies make up 52% of the sections.

Of course, if we give the model access to all the word frequencies in the corpus (not just ‘ennemis’) and ask it to make predictions it does much better:

In [47]: clf = linear_model.LogisticRegression()

In [48]: cross_validation.cross_val_score(clf, dtm, genre == 'tragedy')
Out[48]: array([ 0.97027972,  0.97725284,  0.9614711 ])


Those interested in using a large number of predictors—such as a matrix with 3,000 features—should use the implementation of logistic regression found in scikit-learn. Unlike the default version provided by R or statsmodels, scikit-learn’s version includes a penalty or regularization term, which tends to help prevent overfitting that can occur in models using a large number of predictors.

Logistic regression


Resources for those interested in learning about logistic (and linear) regression include Gelman and Hill (2006) [GH06] and Bishop (2007) [Bis07]. Stanford’s OpenClassroom also has a series of lectures devoted to logistic regression.

Linear regression is one way of thinking about the relationship between two variables. Logistic regression is a linear model as well; it assumes a linear, additive relationship between the predictors and the log odds of a classification. With a single predictor and an intercept term, the relationship between a classification and a predictor has the following symbolic expression:

\[\begin{split}P(y_i = \mathrm{tragedy}) &= \mathrm{logit}^{-1}(\beta_0 + \beta_1 x_i)\\ &= \frac{e^{\beta_0 + \beta x_i}}{1+e^{\beta_0 + \beta_1 x_i}}\\ &= \frac{1}{1+e^{-(\beta_0 + \beta_1 x_i)}}\\ &= \sigma(\beta_0 + \beta_1 x_i)\\\end{split}\]

Typically we have more than one observation. Letting \(\sigma(x_i\beta)\) stand in for \(\frac{1}{1+e^{-(\beta_0 + \beta_1 x_i)}}\) the maximum likelihood estimate for \(\beta\) is the value of \(\beta\) which maximizes the log likelihood of the observations:

\[\begin{split}\log \prod_{i=1}^n P(y_i = \mathrm{tragedy}) &= \sum \left( y_i \log \sigma(x_i \beta) + (1 - y_i) \log (1 - \sigma(x_i \beta)) \right)\\\end{split}\]

While for linear regression there is frequently a closed-form solution for the maximum, logistic regression lacks a tidy solution. The solution (there is indeed a unique maximum) is typically found using iteratively reweighted least squares.

The solution may be found in Python using statsmodels.api.GLM or in R using the built-in glm function. The two functions should yield identical results.

[1]Scikit-learn’s LogisticRegression includes a penalty term which prevents overfitting, something that is a major concern when the number of predictors exceeds the number of observations. Those wishing for a logistic regression model that mirrors R’s glm() should use statsmodels‘s GLM.