{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"try:\n",
" import nltk\n",
"except ModuleNotFoundError:\n",
" !pip install nltk"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"## This code downloads the required packages.\n",
"## You can run `nltk.download('all')` to download everything.\n",
"\n",
"nltk_packages = [\n",
" (\"reuters\", \"corpora/reuters.zip\")\n",
"]\n",
"\n",
"for pid, fid in nltk_packages:\n",
" try:\n",
" nltk.data.find(fid)\n",
" except LookupError:\n",
" nltk.download(pid)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setting up corpus"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"from nltk.corpus import reuters"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setting up train/test data"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"train_documents, train_categories = zip(*[(reuters.raw(i), reuters.categories(i)) for i in reuters.fileids() if i.startswith('training/')])\n",
"test_documents, test_categories = zip(*[(reuters.raw(i), reuters.categories(i)) for i in reuters.fileids() if i.startswith('test/')])"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"all_categories = sorted(list(set(reuters.categories())))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The following cell defines a function **tokenize** that performs following actions:\n",
"- Receive a document as an argument to the function\n",
"- Tokenize the document using `nltk.word_tokenize()`\n",
"- Use `PorterStemmer` provided by the `nltk` to remove morphological affixes from each token\n",
"- Append stemmed token to an already defined list `stems`\n",
"- Return the list `stems`"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"from nltk.stem.porter import PorterStemmer\n",
"def tokenize(text):\n",
" tokens = nltk.word_tokenize(text)\n",
" stems = []\n",
" for item in tokens:\n",
" stems.append(PorterStemmer().stem(item))\n",
" return stems"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To begin, I first used TF-IDF for feature selection on both train as well as test data using `TfidfVectorizer`.\n",
"\n",
"But first, What `TfidfVectorizer` actually does?\n",
"- `TfidfVectorizer` converts a collection of raw documents to a matrix of **TF-IDF** features.\n",
"\n",
"**TF-IDF**?\n",
"- TFIDF (abbreviation of the term *frequency–inverse document frequency*) is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. [tf–idf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)\n",
"\n",
"**Why `TfidfVectorizer`**?\n",
"- `TfidfVectorizer` scale down the impact of tokens that occur very frequently (e.g., “a”, “the”, and “of”) in a given corpus. [Feature Extraction and Transformation](https://spark.apache.ac.cn/docs/latest/mllib-feature-extraction.html#tf-idf)\n",
"\n",
"I gave following two arguments to `TfidfVectorizer`:\n",
"- tokenizer: `tokenize` function\n",
"- stop_words\n",
"\n",
"Then I used `fit_transform` and `transform` on the train and test documents repectively.\n",
"\n",
"**Why `fit_transform` for training data while `transform` for test data**?\n",
"\n",
"To avoid data leakage during cross-validation, imputer computes the statistic on the train data during the `fit`, **stores it** and uses the same on the test data, during the `transform`. This also prevents the test data from appearing in `fit` operation."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
"\n",
"vectorizer = TfidfVectorizer(tokenizer = tokenize, stop_words = 'english')\n",
"\n",
"vectorised_train_documents = vectorizer.fit_transform(train_documents)\n",
"vectorised_test_documents = vectorizer.transform(test_documents)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For the **efficient implementation** of machine learning algorithms, many machine learning algorithms **requires all input variables and output variables to be numeric**. This means that categorical data must be converted to a numerical form.\n",
"\n",
"For this purpose, I used `MultiLabelBinarizer` from `sklearn.preprocessing`."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.preprocessing import MultiLabelBinarizer\n",
"\n",
"mlb = MultiLabelBinarizer()\n",
"train_labels = mlb.fit_transform(train_categories)\n",
"test_labels = mlb.transform(test_categories)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, To **train** the classifier, I used `LinearSVC` in combination with the `OneVsRestClassifier` function in the scikit-learn package.\n",
"\n",
"The strategy of `OneVsRestClassifier` is of **fitting one classifier per label** and the `OneVsRestClassifier` can efficiently do this task and also outputs are easy to interpret. Since each label is represented by **one and only one classifier**, it is possible to gain knowledge about the label by inspecting its corresponding classifier. [OneVsRestClassifier](https://scikit-learn.cn/stable/modules/multiclass.html#one-vs-the-rest)\n",
"\n",
"The reason I combined `LinearSVC` with `OneVsRestClassifier` is because `LinearSVC` supports **Multi-class**, while we want to perform **Multi-label** classification."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"%%capture\n",
"from sklearn.multiclass import OneVsRestClassifier\n",
"from sklearn.svm import LinearSVC\n",
"\n",
"classifier = OneVsRestClassifier(LinearSVC())\n",
"classifier.fit(vectorised_train_documents, train_labels)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"After fitting the classifier, I decided to use `cross_val_score` to **measure score** of the classifier by **cross validation** on the training data. But the only problem was, I wanted to **shuffle** data to use with `cross_val_score`, but it does not support shuffle argument.\n",
"\n",
"So, I decided to use `KFold` with `cross_val_score` as `KFold` supports shuffling the data.\n",
"\n",
"I also enabled `random_state`, because `random_state` will guarantee the same output in each run. By setting the `random_state`, it is guaranteed that the pseudorandom number generator will generate the same sequence of random integers each time, which in turn will affect the split.\n",
"\n",
"Why **42**?\n",
"- [Why '42' is the preferred number when indicating something random?](https://softwareengineering.stackexchange.com/questions/507/why-42-is-the-preferred-number-when-indicating-something-random)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"%%capture\n",
"from sklearn.model_selection import KFold, cross_val_score\n",
"\n",
"kf = KFold(n_splits=10, random_state = 42, shuffle = True)\n",
"scores = cross_val_score(classifier, vectorised_train_documents, train_labels, cv = kf)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Cross-validation scores: [0.83655084 0.86743887 0.8043758 0.83011583 0.83655084 0.81724582\n",
" 0.82754183 0.8030888 0.80694981 0.82731959]\n",
"Cross-validation accuracy: 0.8257 (+/- 0.0368)\n"
]
}
],
"source": [
"print('Cross-validation scores:', scores)\n",
"print('Cross-validation accuracy: {:.4f} (+/- {:.4f})'.format(scores.mean(), scores.std() * 2))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the end, I used different methods (`accuracy_score`, `precision_score`, `recall_score`, `f1_score` and `confusion_matrix`) provided by scikit-learn **to evaluate** the classifier. (both *Macro-* and *Micro-averages*)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"%%capture\n",
"from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix\n",
"\n",
"predictions = classifier.predict(vectorised_test_documents)\n",
"\n",
"accuracy = accuracy_score(test_labels, predictions)\n",
"\n",
"macro_precision = precision_score(test_labels, predictions, average='macro')\n",
"macro_recall = recall_score(test_labels, predictions, average='macro')\n",
"macro_f1 = f1_score(test_labels, predictions, average='macro')\n",
"\n",
"micro_precision = precision_score(test_labels, predictions, average='micro')\n",
"micro_recall = recall_score(test_labels, predictions, average='micro')\n",
"micro_f1 = f1_score(test_labels, predictions, average='micro')\n",
"\n",
"cm = confusion_matrix(test_labels.argmax(axis = 1), predictions.argmax(axis = 1))"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Accuracy: 0.8099\n",
"Precision:\n",
"- Macro: 0.6076\n",
"- Micro: 0.9471\n",
"Recall:\n",
"- Macro: 0.3708\n",
"- Micro: 0.7981\n",
"F1-measure:\n",
"- Macro: 0.4410\n",
"- Micro: 0.8662\n"
]
}
],
"source": [
"print(\"Accuracy: {:.4f}\\nPrecision:\\n- Macro: {:.4f}\\n- Micro: {:.4f}\\nRecall:\\n- Macro: {:.4f}\\n- Micro: {:.4f}\\nF1-measure:\\n- Macro: {:.4f}\\n- Micro: {:.4f}\".format(accuracy, macro_precision, micro_precision, macro_recall, micro_recall, macro_f1, micro_f1))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In below cell, I used `matplotlib.pyplot` to **plot the confusion matrix** (of first *few results only* to keep the readings readable) using `heatmap` of `seaborn`."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<matplotlib.figure.Figure at 0x24d8cf39f28>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import matplotlib.pyplot as plt\n",
"import seaborn as sb\n",
"import pandas as pd\n",
"\n",
"cm_plt = pd.DataFrame(cm[:73])\n",
"\n",
"plt.figure(figsize = (25, 25))\n",
"ax = plt.axes()\n",
"\n",
"sb.heatmap(cm_plt, annot=True)\n",
"\n",
"ax.xaxis.set_ticks_position('top')\n",
"\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Pipeline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, I took the data from [Coconut - Wikipedia](https://en.wikipedia.org/wiki/Coconut) to check if the classifier is able to **correctly** predict the label(s) or not.\n",
"\n",
"And here is the output:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Example labels: [('coconut', 'oilseed')]\n"
]
}
],
"source": [
"example_text = '''The coconut tree (Cocos nucifera) is a member of the family Arecaceae (palm family) and the only species of the genus Cocos.\n",
"The term coconut can refer to the whole coconut palm or the seed, or the fruit, which, botanically, is a drupe, not a nut.\n",
"The spelling cocoanut is an archaic form of the word.\n",
"The term is derived from the 16th-century Portuguese and Spanish word coco meaning \"head\" or \"skull\", from the three indentations on the coconut shell that resemble facial features.\n",
"Coconuts are known for their versatility ranging from food to cosmetics.\n",
"They form a regular part of the diets of many people in the tropics and subtropics.\n",
"Coconuts are distinct from other fruits for their endosperm containing a large quantity of water (also called \"milk\"), and when immature, may be harvested for the potable coconut water.\n",
"When mature, they can be used as seed nuts or processed for oil, charcoal from the hard shell, and coir from the fibrous husk.\n",
"When dried, the coconut flesh is called copra.\n",
"The oil and milk derived from it are commonly used in cooking and frying, as well as in soaps and cosmetics.\n",
"The husks and leaves can be used as material to make a variety of products for furnishing and decorating.\n",
"The coconut also has cultural and religious significance in certain societies, particularly in India, where it is used in Hindu rituals.'''\n",
"\n",
"example_preds = classifier.predict(vectorizer.transform([example_text]))\n",
"example_labels = mlb.inverse_transform(example_preds)\n",
"print(\"Example labels: {}\".format(example_labels))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.6"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
try:
import nltk
except ModuleNotFoundError:
!pip install nltk
## This code downloads the required packages.
## You can run `nltk.download('all')` to download everything.
nltk_packages = [
("reuters", "corpora/reuters.zip")
]
for pid, fid in nltk_packages:
try:
nltk.data.find(fid)
except LookupError:
nltk.download(pid)
from nltk.corpus import reuters
train_documents, train_categories = zip(*[(reuters.raw(i), reuters.categories(i)) for i in reuters.fileids() if i.startswith('training/')])
test_documents, test_categories = zip(*[(reuters.raw(i), reuters.categories(i)) for i in reuters.fileids() if i.startswith('test/')])
all_categories = sorted(list(set(reuters.categories())))
以下单元格定义了一个函数 tokenize,它执行以下操作
nltk.word_tokenize()
对文档进行分词nltk
提供的 PorterStemmer
从每个词元中删除形态词缀stems
中stems
from nltk.stem.porter import PorterStemmer
def tokenize(text):
tokens = nltk.word_tokenize(text)
stems = []
for item in tokens:
stems.append(PorterStemmer().stem(item))
return stems
首先,我使用 TF-IDF 对训练和测试数据进行了特征选择,使用的是 TfidfVectorizer
。
但首先,TfidfVectorizer
究竟做了什么?
TfidfVectorizer
将一组原始文档转换为 TF-IDF 特征矩阵。TF-IDF?
为什么要使用 TfidfVectorizer
?
TfidfVectorizer
会降低在给定语料库中频繁出现的词元(例如,“a”、“the” 和 “of”)的影响。 特征提取和转换我为 TfidfVectorizer
提供了以下两个参数
tokenize
函数然后我分别对训练和测试文档使用 fit_transform
和 transform
。
为什么训练数据使用 fit_transform
,而测试数据使用 transform
?
为了避免在交叉验证期间出现数据泄露,估算器在 fit
期间对训练数据计算统计量,存储它,并在 transform
期间对测试数据使用相同的统计量。这还防止测试数据出现在 fit
操作中。
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(tokenizer = tokenize, stop_words = 'english')
vectorised_train_documents = vectorizer.fit_transform(train_documents)
vectorised_test_documents = vectorizer.transform(test_documents)
为了高效实现机器学习算法,许多机器学习算法要求所有输入变量和输出变量都为数值型。这意味着分类数据必须转换为数值形式。
为此,我使用了 sklearn.preprocessing
中的 MultiLabelBinarizer
。
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
train_labels = mlb.fit_transform(train_categories)
test_labels = mlb.transform(test_categories)
现在,为了训练分类器,我使用 LinearSVC
与 scikit-learn 包中的 OneVsRestClassifier
函数结合使用。
OneVsRestClassifier
的策略是针对每个标签拟合一个分类器,OneVsRestClassifier
可以有效地完成这项任务,并且输出也易于解释。由于每个标签都由一个且仅一个分类器表示,因此可以通过检查其相应的分类器来获取有关标签的知识。 OneVsRestClassifier
我将 LinearSVC
与 OneVsRestClassifier
结合使用的原因是 LinearSVC
支持多类别分类,而我们想要执行多标签分类。
%%capture
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import LinearSVC
classifier = OneVsRestClassifier(LinearSVC())
classifier.fit(vectorised_train_documents, train_labels)
在拟合分类器后,我决定使用 cross_val_score
通过对训练数据进行交叉验证来衡量分类器的分数。但唯一的问题是,我想打乱数据以用于 cross_val_score
,但它不支持打乱参数。
因此,我决定将 KFold
与 cross_val_score
一起使用,因为 KFold
支持打乱数据。
我还启用了 random_state
,因为 random_state
将保证每次运行都输出相同的结果。通过设置 random_state
,可以保证伪随机数生成器每次都会生成相同的随机整数序列,这反过来会影响拆分。
为什么是 42?
%%capture
from sklearn.model_selection import KFold, cross_val_score
kf = KFold(n_splits=10, random_state = 42, shuffle = True)
scores = cross_val_score(classifier, vectorised_train_documents, train_labels, cv = kf)
print('Cross-validation scores:', scores)
print('Cross-validation accuracy: {:.4f} (+/- {:.4f})'.format(scores.mean(), scores.std() * 2))
Cross-validation scores: [0.83655084 0.86743887 0.8043758 0.83011583 0.83655084 0.81724582
0.82754183 0.8030888 0.80694981 0.82731959]
Cross-validation accuracy: 0.8257 (+/- 0.0368)
最后,我使用了 scikit-learn 提供的不同方法(accuracy_score
、precision_score
、recall_score
、f1_score
和 confusion_matrix
)来评估分类器。(宏观 和 微观 平均值)
%%capture
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
predictions = classifier.predict(vectorised_test_documents)
accuracy = accuracy_score(test_labels, predictions)
macro_precision = precision_score(test_labels, predictions, average='macro')
macro_recall = recall_score(test_labels, predictions, average='macro')
macro_f1 = f1_score(test_labels, predictions, average='macro')
micro_precision = precision_score(test_labels, predictions, average='micro')
micro_recall = recall_score(test_labels, predictions, average='micro')
micro_f1 = f1_score(test_labels, predictions, average='micro')
cm = confusion_matrix(test_labels.argmax(axis = 1), predictions.argmax(axis = 1))
print("Accuracy: {:.4f}\nPrecision:\n- Macro: {:.4f}\n- Micro: {:.4f}\nRecall:\n- Macro: {:.4f}\n- Micro: {:.4f}\nF1-measure:\n- Macro: {:.4f}\n- Micro: {:.4f}".format(accuracy, macro_precision, micro_precision, macro_recall, micro_recall, macro_f1, micro_f1))
Accuracy: 0.8099
Precision:
- Macro: 0.6076
- Micro: 0.9471
Recall:
- Macro: 0.3708
- Micro: 0.7981
F1-measure:
- Macro: 0.4410
- Micro: 0.8662
在下面的单元格中,我使用 matplotlib.pyplot
来绘制混淆矩阵(仅前几项结果,以保持结果可读),使用 seaborn
的 heatmap
。
import matplotlib.pyplot as plt
import seaborn as sb
import pandas as pd
cm_plt = pd.DataFrame(cm[:73])
plt.figure(figsize = (25, 25))
ax = plt.axes()
sb.heatmap(cm_plt, annot=True)
ax.xaxis.set_ticks_position('top')
plt.show()
现在,我从 椰子 - 维基百科 中获取数据,以检查分类器是否能够正确预测标签。
以下是输出
example_text = '''The coconut tree (Cocos nucifera) is a member of the family Arecaceae (palm family) and the only species of the genus Cocos.
The term coconut can refer to the whole coconut palm or the seed, or the fruit, which, botanically, is a drupe, not a nut.
The spelling cocoanut is an archaic form of the word.
The term is derived from the 16th-century Portuguese and Spanish word coco meaning "head" or "skull", from the three indentations on the coconut shell that resemble facial features.
Coconuts are known for their versatility ranging from food to cosmetics.
They form a regular part of the diets of many people in the tropics and subtropics.
Coconuts are distinct from other fruits for their endosperm containing a large quantity of water (also called "milk"), and when immature, may be harvested for the potable coconut water.
When mature, they can be used as seed nuts or processed for oil, charcoal from the hard shell, and coir from the fibrous husk.
When dried, the coconut flesh is called copra.
The oil and milk derived from it are commonly used in cooking and frying, as well as in soaps and cosmetics.
The husks and leaves can be used as material to make a variety of products for furnishing and decorating.
The coconut also has cultural and religious significance in certain societies, particularly in India, where it is used in Hindu rituals.'''
example_preds = classifier.predict(vectorizer.transform([example_text]))
example_labels = mlb.inverse_transform(example_preds)
print("Example labels: {}".format(example_labels))
Example labels: [('coconut', 'oilseed')]