{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Named Entity Recognition with Conditional Random Fields"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"One of the classic challenges of Natural Language Processing is sequence labelling. In sequence labelling, the goal is to label each word in a text with a word class. In part-of-speech tagging, these word classes are parts of speech, such as noun or verb. In named entity recognition (NER), they're types of generic named entities, such as locations, people or organizations, or more specialized entities, such as diseases or symptoms in the healthcare domain. In this way, sequence labelling can help us extract the most important information from a text and improve the performance of analytics, search or matching applications. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this notebook we'll explore Conditional Random Fields, the most popular approach to sequence labelling before Deep Learning arrived. Deep Learning may get all the attention right now, but Conditional Random Fields are still a powerful tool to build a simple sequence labeller. \n",
"\n",
"The tool we're going to use is `sklearn-crfsuite`. This is a wrapper around `python-crfsuite`, which itself is a Python binding of [CRFSuite](http://www.chokkan.org/software/crfsuite/). The reason we're using `sklearn-crfsuite` is that it provides a number of handy utility functions, for example for evaluating the output of the model. You can install it with `pip install sklearn-crfsuite`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First we get some data. A well-known data set for training and testing NER models is the CoNLL-2002 data, which has Spanish and Dutch texts labelled with four types of entities: locations (LOC), persons (PER), organizations (ORG) and miscellaneous entities (MISC). Both corpora are split up in three portions: a training portion and two smaller test portions, one of which we'll use as development data. It's easy to collect the data from NLTK. "
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import nltk\n",
"import sklearn\n",
"from sklearn.metrics import classification_report, confusion_matrix\n",
"from sklearn.preprocessing import LabelBinarizer\n",
"import sklearn_crfsuite as crfsuite\n",
"from sklearn_crfsuite import metrics"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"train_sents = list(nltk.corpus.conll2002.iob_sents('ned.train'))\n",
"dev_sents = list(nltk.corpus.conll2002.iob_sents('ned.testa'))\n",
"test_sents = list(nltk.corpus.conll2002.iob_sents('ned.testb'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The data consists of a list of tokenized sentences. For each of the tokens we have the string itself, its part-of-speech tag and its entity tag, which follows the BIO convention. In the deep learning world we live in today, it's common to ignore the part-of-speech tags. However, since CRFs rely on good feature extraction, we'll gladly make use of this information. After all, the part of speech of a word tells us a lot about its possible status as a named entity: nouns will more often be entities than verbs, for example."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('De', 'Art', 'O'),\n",
" ('tekst', 'N', 'O'),\n",
" ('van', 'Prep', 'O'),\n",
" ('het', 'Art', 'O'),\n",
" ('arrest', 'N', 'O'),\n",
" ('is', 'V', 'O'),\n",
" ('nog', 'Adv', 'O'),\n",
" ('niet', 'Adv', 'O'),\n",
" ('schriftelijk', 'Adj', 'O'),\n",
" ('beschikbaar', 'Adj', 'O'),\n",
" ('maar', 'Conj', 'O'),\n",
" ('het', 'Art', 'O'),\n",
" ('bericht', 'N', 'O'),\n",
" ('werd', 'V', 'O'),\n",
" ('alvast', 'Adv', 'O'),\n",
" ('bekendgemaakt', 'V', 'O'),\n",
" ('door', 'Prep', 'O'),\n",
" ('een', 'Art', 'O'),\n",
" ('communicatiebureau', 'N', 'O'),\n",
" ('dat', 'Conj', 'O'),\n",
" ('Floralux', 'N', 'B-ORG'),\n",
" ('inhuurde', 'V', 'O'),\n",
" ('.', 'Punc', 'O')]"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train_sents[0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Feature Extraction"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Whereas today neural networks are expected to learn the relevant features of the input texts themselves, this is very different with Conditional Random Fields. CRFs learn the relationship between the features we give them and the label of a token in a given context. They're not going to earn these features themselves. Instead, the quality of the model will depend highly on the relevance of the features we show it. \n",
"\n",
"The most important method in this tutorial is therefore the one that collects the features for every token. What information could be useful? The word itself, of course, together with its part of speech tag. It can also be interesting to know whether the word is completely uppercase, whether it starts with a capital or is a digit. In addition, we also take a look at the character bigram and trigram the word ends with. We also give every token a `bias` feature, which always has the same value. This bias feature helps the CRF learn the relative frequency of each label type in the training data.\n",
"\n",
"To give the CRF more information about the meaning of a word, we also introduce information from word embeddings. In our [Word Embedding notebook](https://github.com/nlptown/nlp-notebooks/blob/master/An%20Introduction%20to%20Word%20Embeddings.ipynb), we trained word embeddings on Dutch Wikipedia and clustered them in 500 clusters. Here we'll read these 500 clusters from a file, and map each word to the id of the cluster it is in. This is really useful for Named Entity Recognition, as most entity types cluster together. This allows CRFs to generalize above the word level. For example, when the CRF encounters a word it has never seen (say, *Albania*), it can base its decision on the cluster the word is in. If this cluster contains many other entities the CRF has met in its training data (say, *Italy*, *Germany* and *France*), it will have learnt a string link between this cluster and a specific entity type. As a result, it can still assign that entity type to the unknown word. In our experiments, this feature alone boosts the performance with around 3%. \n",
"\n",
"Finally, apart from the token itself, we also want the CRF to look at its context. More specifically, we're going to give it some extra information about the two words to the left and the right of the targt word. We'll tell the CRF what these words are, whether they start with a capital or are completely uppercase, and give it their part-of-speech tag. If there is no left or right context, we'll inform the CRF that the token is at the beginning or end of the sentence (`BOS` or `EOS`). "
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"def read_clusters(cluster_file):\n",
" word2cluster = {}\n",
" with open(cluster_file) as i:\n",
" for line in i:\n",
" word, cluster = line.strip().split('\\t')\n",
" word2cluster[word] = cluster\n",
" return word2cluster\n",
"\n",
"\n",
"def word2features(sent, i, word2cluster):\n",
" word = sent[i][0]\n",
" postag = sent[i][1]\n",
" features = [\n",
" 'bias',\n",
" 'word.lower=' + word.lower(),\n",
" 'word[-3:]=' + word[-3:],\n",
" 'word[-2:]=' + word[-2:],\n",
" 'word.isupper=%s' % word.isupper(),\n",
" 'word.istitle=%s' % word.istitle(),\n",
" 'word.isdigit=%s' % word.isdigit(),\n",
" 'word.cluster=%s' % word2cluster[word.lower()] if word.lower() in word2cluster else \"0\",\n",
" 'postag=' + postag\n",
" ]\n",
" if i > 0:\n",
" word1 = sent[i-1][0]\n",
" postag1 = sent[i-1][1]\n",
" features.extend([\n",
" '-1:word.lower=' + word1.lower(),\n",
" '-1:word.istitle=%s' % word1.istitle(),\n",
" '-1:word.isupper=%s' % word1.isupper(),\n",
" '-1:postag=' + postag1\n",
" ])\n",
" else:\n",
" features.append('BOS')\n",
"\n",
" if i > 1: \n",
" word2 = sent[i-2][0]\n",
" postag2 = sent[i-2][1]\n",
" features.extend([\n",
" '-2:word.lower=' + word2.lower(),\n",
" '-2:word.istitle=%s' % word2.istitle(),\n",
" '-2:word.isupper=%s' % word2.isupper(),\n",
" '-2:postag=' + postag2\n",
" ]) \n",
"\n",
" \n",
" if i < len(sent)-1:\n",
" word1 = sent[i+1][0]\n",
" postag1 = sent[i+1][1]\n",
" features.extend([\n",
" '+1:word.lower=' + word1.lower(),\n",
" '+1:word.istitle=%s' % word1.istitle(),\n",
" '+1:word.isupper=%s' % word1.isupper(),\n",
" '+1:postag=' + postag1\n",
" ])\n",
" else:\n",
" features.append('EOS')\n",
"\n",
" if i < len(sent)-2:\n",
" word2 = sent[i+2][0]\n",
" postag2 = sent[i+2][1]\n",
" features.extend([\n",
" '+2:word.lower=' + word2.lower(),\n",
" '+2:word.istitle=%s' % word2.istitle(),\n",
" '+2:word.isupper=%s' % word2.isupper(),\n",
" '+2:postag=' + postag2\n",
" ])\n",
"\n",
" \n",
" return features\n",
"\n",
"\n",
"def sent2features(sent, word2cluster):\n",
" return [word2features(sent, i, word2cluster) for i in range(len(sent))]\n",
"\n",
"def sent2labels(sent):\n",
" return [label for token, postag, label in sent]\n",
"\n",
"def sent2tokens(sent):\n",
" return [token for token, postag, label in sent]\n",
"\n",
"word2cluster = read_clusters(\"data/embeddings/clusters_nl.tsv\")"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['bias',\n",
" 'word.lower=de',\n",
" 'word[-3:]=De',\n",
" 'word[-2:]=De',\n",
" 'word.isupper=False',\n",
" 'word.istitle=True',\n",
" 'word.isdigit=False',\n",
" 'word.cluster=38',\n",
" 'postag=Art',\n",
" 'BOS',\n",
" '+1:word.lower=tekst',\n",
" '+1:word.istitle=False',\n",
" '+1:word.isupper=False',\n",
" '+1:postag=N',\n",
" '+2:word.lower=van',\n",
" '+2:word.istitle=False',\n",
" '+2:word.isupper=False',\n",
" '+2:postag=Prep']"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sent2features(train_sents[0], word2cluster)[0]"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"X_train = [sent2features(s, word2cluster) for s in train_sents]\n",
"y_train = [sent2labels(s) for s in train_sents]\n",
"\n",
"X_dev = [sent2features(s, word2cluster) for s in dev_sents]\n",
"y_dev = [sent2labels(s) for s in dev_sents]\n",
"\n",
"X_test = [sent2features(s, word2cluster) for s in test_sents]\n",
"y_test = [sent2labels(s) for s in test_sents]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Training"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We now create a CRF model and train it. We'll use the standard [L-BFGS](https://en.wikipedia.org/wiki/Limited-memory_BFGS) algorithm for our parameter estimation and run it for 100 iterations. When we're done, we save the model with `joblib`."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"loading training data to CRFsuite: 100%|██████████| 15806/15806 [00:02<00:00, 7623.17it/s]\n",
"loading dev data to CRFsuite: 27%|██▋ | 769/2895 [00:00<00:00, 7689.13it/s]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"loading dev data to CRFsuite: 100%|██████████| 2895/2895 [00:00<00:00, 7186.08it/s]\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Holdout group: 2\n",
"\n",
"Feature generation\n",
"type: CRF1d\n",
"feature.minfreq: 0.000000\n",
"feature.possible_states: 0\n",
"feature.possible_transitions: 0\n",
"0....1....2....3....4....5....6....7....8....9....10\n",
"Number of features: 152117\n",
"Seconds required: 0.424\n",
"\n",
"L-BFGS optimization\n",
"c1: 0.000000\n",
"c2: 1.000000\n",
"num_memories: 6\n",
"max_iterations: 100\n",
"epsilon: 0.000010\n",
"stop: 10\n",
"delta: 0.000010\n",
"linesearch: MoreThuente\n",
"linesearch.max_iterations: 20\n",
"\n",
"Iter 1 time=0.37 loss=104214.83 active=152117 precision=0.100 recall=0.111 F1=0.105 Acc(item/seq)=0.901 0.496 feature_norm=1.00\n",
"Iter 2 time=0.21 loss=96997.81 active=152117 precision=0.100 recall=0.111 F1=0.105 Acc(item/seq)=0.901 0.496 feature_norm=1.13\n",
"Iter 3 time=0.21 loss=92085.38 active=152117 precision=0.100 recall=0.111 F1=0.105 Acc(item/seq)=0.901 0.496 feature_norm=1.26\n",
"Iter 4 time=0.21 loss=84277.67 active=152117 precision=0.100 recall=0.111 F1=0.105 Acc(item/seq)=0.901 0.496 feature_norm=1.51\n",
"Iter 5 time=0.21 loss=67577.53 active=152117 precision=0.169 recall=0.113 F1=0.109 Acc(item/seq)=0.902 0.496 feature_norm=2.32\n",
"Iter 6 time=0.21 loss=47854.26 active=152117 precision=0.326 recall=0.347 F1=0.320 Acc(item/seq)=0.930 0.580 feature_norm=4.34\n",
"Iter 7 time=0.21 loss=43326.19 active=152117 precision=0.340 recall=0.365 F1=0.333 Acc(item/seq)=0.933 0.592 feature_norm=5.06\n",
"Iter 8 time=0.21 loss=38617.07 active=152117 precision=0.372 recall=0.399 F1=0.362 Acc(item/seq)=0.938 0.618 feature_norm=6.43\n",
"Iter 9 time=0.21 loss=35511.85 active=152117 precision=0.491 recall=0.442 F1=0.421 Acc(item/seq)=0.942 0.631 feature_norm=8.62\n",
"Iter 10 time=0.21 loss=32735.31 active=152117 precision=0.535 recall=0.456 F1=0.445 Acc(item/seq)=0.944 0.654 feature_norm=9.50\n",
"Iter 11 time=0.21 loss=31687.17 active=152117 precision=0.530 recall=0.491 F1=0.502 Acc(item/seq)=0.947 0.663 feature_norm=10.94\n",
"Iter 12 time=0.21 loss=29323.19 active=152117 precision=0.532 recall=0.482 F1=0.477 Acc(item/seq)=0.946 0.666 feature_norm=11.18\n",
"Iter 13 time=0.21 loss=28733.55 active=152117 precision=0.596 recall=0.489 F1=0.489 Acc(item/seq)=0.947 0.668 feature_norm=11.58\n",
"Iter 14 time=0.21 loss=27120.69 active=152117 precision=0.640 recall=0.507 F1=0.512 Acc(item/seq)=0.948 0.677 feature_norm=12.36\n",
"Iter 15 time=0.21 loss=24849.05 active=152117 precision=0.640 recall=0.547 F1=0.558 Acc(item/seq)=0.952 0.697 feature_norm=13.86\n",
"Iter 16 time=0.40 loss=24033.40 active=152117 precision=0.654 recall=0.580 F1=0.586 Acc(item/seq)=0.954 0.706 feature_norm=14.53\n",
"Iter 17 time=0.21 loss=22935.94 active=152117 precision=0.669 recall=0.578 F1=0.598 Acc(item/seq)=0.955 0.712 feature_norm=15.15\n",
"Iter 18 time=0.21 loss=21803.53 active=152117 precision=0.682 recall=0.584 F1=0.605 Acc(item/seq)=0.956 0.713 feature_norm=15.67\n",
"Iter 19 time=0.21 loss=21046.75 active=152117 precision=0.725 recall=0.565 F1=0.586 Acc(item/seq)=0.956 0.717 feature_norm=16.04\n",
"Iter 20 time=0.21 loss=20465.96 active=152117 precision=0.701 recall=0.566 F1=0.594 Acc(item/seq)=0.956 0.711 feature_norm=15.81\n",
"Iter 21 time=0.21 loss=19991.29 active=152117 precision=0.706 recall=0.586 F1=0.611 Acc(item/seq)=0.957 0.717 feature_norm=15.63\n",
"Iter 22 time=0.21 loss=19560.23 active=152117 precision=0.684 recall=0.597 F1=0.616 Acc(item/seq)=0.957 0.722 feature_norm=15.58\n",
"Iter 23 time=0.21 loss=19241.14 active=152117 precision=0.672 recall=0.602 F1=0.615 Acc(item/seq)=0.957 0.726 feature_norm=15.65\n",
"Iter 24 time=0.21 loss=18787.87 active=152117 precision=0.678 recall=0.627 F1=0.637 Acc(item/seq)=0.958 0.731 feature_norm=16.31\n",
"Iter 25 time=0.21 loss=18145.13 active=152117 precision=0.690 recall=0.625 F1=0.640 Acc(item/seq)=0.959 0.734 feature_norm=16.94\n",
"Iter 26 time=0.21 loss=17786.38 active=152117 precision=0.710 recall=0.621 F1=0.642 Acc(item/seq)=0.959 0.738 feature_norm=17.48\n",
"Iter 27 time=0.21 loss=17247.02 active=152117 precision=0.711 recall=0.625 F1=0.649 Acc(item/seq)=0.959 0.736 feature_norm=18.46\n",
"Iter 28 time=0.21 loss=16876.01 active=152117 precision=0.737 recall=0.627 F1=0.655 Acc(item/seq)=0.960 0.748 feature_norm=19.83\n",
"Iter 29 time=0.21 loss=16543.87 active=152117 precision=0.732 recall=0.631 F1=0.663 Acc(item/seq)=0.961 0.751 feature_norm=20.12\n",
"Iter 30 time=0.21 loss=16263.21 active=152117 precision=0.725 recall=0.644 F1=0.671 Acc(item/seq)=0.962 0.753 feature_norm=20.42\n",
"Iter 31 time=0.21 loss=15665.78 active=152117 precision=0.715 recall=0.661 F1=0.676 Acc(item/seq)=0.963 0.758 feature_norm=21.50\n",
"Iter 32 time=0.21 loss=15247.34 active=152117 precision=0.700 recall=0.641 F1=0.650 Acc(item/seq)=0.961 0.748 feature_norm=23.18\n",
"Iter 33 time=0.21 loss=14866.51 active=152117 precision=0.702 recall=0.658 F1=0.670 Acc(item/seq)=0.963 0.754 feature_norm=24.67\n",
"Iter 34 time=0.21 loss=14650.96 active=152117 precision=0.704 recall=0.659 F1=0.675 Acc(item/seq)=0.963 0.755 feature_norm=25.38\n",
"Iter 35 time=0.21 loss=14386.98 active=152117 precision=0.730 recall=0.677 F1=0.695 Acc(item/seq)=0.964 0.761 feature_norm=26.47\n",
"Iter 36 time=0.21 loss=14158.49 active=152117 precision=0.742 recall=0.681 F1=0.704 Acc(item/seq)=0.965 0.763 feature_norm=28.58\n",
"Iter 37 time=0.21 loss=13895.30 active=152117 precision=0.736 recall=0.684 F1=0.701 Acc(item/seq)=0.965 0.765 feature_norm=28.72\n",
"Iter 38 time=0.21 loss=13656.45 active=152117 precision=0.730 recall=0.683 F1=0.695 Acc(item/seq)=0.965 0.768 feature_norm=29.07\n",
"Iter 39 time=0.21 loss=13499.28 active=152117 precision=0.727 recall=0.680 F1=0.691 Acc(item/seq)=0.965 0.769 feature_norm=29.61\n",
"Iter 40 time=0.21 loss=13174.95 active=152117 precision=0.726 recall=0.677 F1=0.689 Acc(item/seq)=0.965 0.766 feature_norm=31.03\n",
"Iter 41 time=0.21 loss=13104.00 active=152117 precision=0.736 recall=0.662 F1=0.678 Acc(item/seq)=0.964 0.760 feature_norm=33.06\n",
"Iter 42 time=0.21 loss=12750.79 active=152117 precision=0.731 recall=0.685 F1=0.701 Acc(item/seq)=0.966 0.764 feature_norm=34.35\n",
"Iter 43 time=0.21 loss=12637.02 active=152117 precision=0.740 recall=0.690 F1=0.708 Acc(item/seq)=0.966 0.766 feature_norm=34.63\n",
"Iter 44 time=0.21 loss=12534.20 active=152117 precision=0.745 recall=0.692 F1=0.712 Acc(item/seq)=0.966 0.766 feature_norm=35.30\n",
"Iter 45 time=0.21 loss=12390.89 active=152117 precision=0.739 recall=0.682 F1=0.702 Acc(item/seq)=0.966 0.761 feature_norm=37.15\n",
"Iter 46 time=0.21 loss=12277.42 active=152117 precision=0.733 recall=0.689 F1=0.704 Acc(item/seq)=0.966 0.763 feature_norm=37.58\n",
"Iter 47 time=0.21 loss=12219.03 active=152117 precision=0.739 recall=0.690 F1=0.706 Acc(item/seq)=0.966 0.767 feature_norm=37.20\n",
"Iter 48 time=0.21 loss=12125.64 active=152117 precision=0.743 recall=0.695 F1=0.711 Acc(item/seq)=0.967 0.770 feature_norm=36.99\n",
"Iter 49 time=0.21 loss=11970.65 active=152117 precision=0.745 recall=0.697 F1=0.712 Acc(item/seq)=0.967 0.775 feature_norm=36.84\n",
"Iter 50 time=0.21 loss=11780.73 active=152117 precision=0.754 recall=0.696 F1=0.718 Acc(item/seq)=0.968 0.777 feature_norm=37.57\n",
"Iter 51 time=0.21 loss=11623.83 active=152117 precision=0.740 recall=0.691 F1=0.710 Acc(item/seq)=0.967 0.774 feature_norm=38.21\n",
"Iter 52 time=0.21 loss=11549.38 active=152117 precision=0.740 recall=0.690 F1=0.709 Acc(item/seq)=0.967 0.773 feature_norm=38.94\n",
"Iter 53 time=0.21 loss=11497.85 active=152117 precision=0.739 recall=0.696 F1=0.713 Acc(item/seq)=0.967 0.772 feature_norm=39.54\n",
"Iter 54 time=0.21 loss=11419.64 active=152117 precision=0.733 recall=0.692 F1=0.707 Acc(item/seq)=0.966 0.773 feature_norm=40.40\n",
"Iter 55 time=0.21 loss=11280.29 active=152117 precision=0.744 recall=0.707 F1=0.719 Acc(item/seq)=0.967 0.773 feature_norm=41.75\n",
"Iter 56 time=0.21 loss=11131.39 active=152117 precision=0.746 recall=0.710 F1=0.722 Acc(item/seq)=0.968 0.773 feature_norm=42.72\n",
"Iter 57 time=0.21 loss=11043.40 active=152117 precision=0.751 recall=0.713 F1=0.726 Acc(item/seq)=0.968 0.774 feature_norm=42.92\n",
"Iter 58 time=0.21 loss=10954.38 active=152117 precision=0.769 recall=0.713 F1=0.736 Acc(item/seq)=0.969 0.781 feature_norm=43.18\n",
"Iter 59 time=0.21 loss=10836.31 active=152117 precision=0.773 recall=0.713 F1=0.736 Acc(item/seq)=0.969 0.781 feature_norm=43.74\n",
"Iter 60 time=0.21 loss=10712.24 active=152117 precision=0.779 recall=0.719 F1=0.744 Acc(item/seq)=0.970 0.788 feature_norm=44.41\n",
"Iter 61 time=0.21 loss=10602.81 active=152117 precision=0.789 recall=0.709 F1=0.740 Acc(item/seq)=0.970 0.789 feature_norm=44.68\n",
"Iter 62 time=0.21 loss=10508.84 active=152117 precision=0.782 recall=0.711 F1=0.739 Acc(item/seq)=0.970 0.787 feature_norm=45.37\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Iter 63 time=0.21 loss=10458.88 active=152117 precision=0.783 recall=0.717 F1=0.744 Acc(item/seq)=0.970 0.788 feature_norm=45.51\n",
"Iter 64 time=0.21 loss=10420.78 active=152117 precision=0.763 recall=0.711 F1=0.730 Acc(item/seq)=0.969 0.786 feature_norm=45.67\n",
"Iter 65 time=0.21 loss=10315.28 active=152117 precision=0.766 recall=0.721 F1=0.735 Acc(item/seq)=0.969 0.786 feature_norm=46.30\n",
"Iter 66 time=0.21 loss=10204.10 active=152117 precision=0.769 recall=0.728 F1=0.740 Acc(item/seq)=0.970 0.786 feature_norm=47.10\n",
"Iter 67 time=0.21 loss=10134.54 active=152117 precision=0.769 recall=0.716 F1=0.737 Acc(item/seq)=0.970 0.787 feature_norm=47.87\n",
"Iter 68 time=0.21 loss=10095.10 active=152117 precision=0.773 recall=0.718 F1=0.741 Acc(item/seq)=0.970 0.787 feature_norm=47.85\n",
"Iter 69 time=0.21 loss=10059.52 active=152117 precision=0.773 recall=0.715 F1=0.738 Acc(item/seq)=0.970 0.790 feature_norm=47.81\n",
"Iter 70 time=0.21 loss=10012.53 active=152117 precision=0.767 recall=0.712 F1=0.733 Acc(item/seq)=0.970 0.790 feature_norm=47.95\n",
"Iter 71 time=0.21 loss=9931.41 active=152117 precision=0.765 recall=0.710 F1=0.731 Acc(item/seq)=0.970 0.791 feature_norm=48.49\n",
"Iter 72 time=0.21 loss=9861.45 active=152117 precision=0.763 recall=0.709 F1=0.730 Acc(item/seq)=0.970 0.793 feature_norm=49.07\n",
"Iter 73 time=0.21 loss=9803.75 active=152117 precision=0.769 recall=0.723 F1=0.741 Acc(item/seq)=0.970 0.796 feature_norm=49.32\n",
"Iter 74 time=0.21 loss=9762.61 active=152117 precision=0.758 recall=0.720 F1=0.733 Acc(item/seq)=0.970 0.793 feature_norm=49.58\n",
"Iter 75 time=0.21 loss=9726.47 active=152117 precision=0.761 recall=0.722 F1=0.736 Acc(item/seq)=0.970 0.790 feature_norm=49.80\n",
"Iter 76 time=0.21 loss=9628.97 active=152117 precision=0.764 recall=0.722 F1=0.736 Acc(item/seq)=0.970 0.789 feature_norm=50.47\n",
"Iter 77 time=0.40 loss=9586.61 active=152117 precision=0.763 recall=0.725 F1=0.740 Acc(item/seq)=0.971 0.791 feature_norm=50.96\n",
"Iter 78 time=0.21 loss=9522.00 active=152117 precision=0.767 recall=0.723 F1=0.741 Acc(item/seq)=0.970 0.788 feature_norm=51.34\n",
"Iter 79 time=0.21 loss=9479.87 active=152117 precision=0.765 recall=0.720 F1=0.736 Acc(item/seq)=0.970 0.788 feature_norm=51.77\n",
"Iter 80 time=0.21 loss=9448.33 active=152117 precision=0.771 recall=0.720 F1=0.739 Acc(item/seq)=0.970 0.789 feature_norm=51.80\n",
"Iter 81 time=0.21 loss=9423.55 active=152117 precision=0.769 recall=0.723 F1=0.740 Acc(item/seq)=0.970 0.789 feature_norm=51.83\n",
"Iter 82 time=0.21 loss=9367.51 active=152117 precision=0.763 recall=0.720 F1=0.736 Acc(item/seq)=0.970 0.790 feature_norm=52.25\n",
"Iter 83 time=0.21 loss=9322.98 active=152117 precision=0.762 recall=0.722 F1=0.737 Acc(item/seq)=0.970 0.793 feature_norm=52.34\n",
"Iter 84 time=0.21 loss=9277.51 active=152117 precision=0.765 recall=0.725 F1=0.740 Acc(item/seq)=0.971 0.798 feature_norm=52.47\n",
"Iter 85 time=0.21 loss=9229.07 active=152117 precision=0.772 recall=0.726 F1=0.743 Acc(item/seq)=0.971 0.798 feature_norm=52.79\n",
"Iter 86 time=0.21 loss=9214.50 active=152117 precision=0.782 recall=0.729 F1=0.751 Acc(item/seq)=0.971 0.799 feature_norm=52.88\n",
"Iter 87 time=0.21 loss=9194.55 active=152117 precision=0.785 recall=0.731 F1=0.753 Acc(item/seq)=0.971 0.800 feature_norm=52.82\n",
"Iter 88 time=0.21 loss=9182.23 active=152117 precision=0.783 recall=0.728 F1=0.750 Acc(item/seq)=0.971 0.798 feature_norm=52.86\n",
"Iter 89 time=0.21 loss=9161.54 active=152117 precision=0.786 recall=0.726 F1=0.750 Acc(item/seq)=0.971 0.795 feature_norm=52.98\n",
"Iter 90 time=0.21 loss=9130.06 active=152117 precision=0.788 recall=0.728 F1=0.752 Acc(item/seq)=0.971 0.797 feature_norm=53.17\n",
"Iter 91 time=0.21 loss=9063.09 active=152117 precision=0.794 recall=0.738 F1=0.760 Acc(item/seq)=0.972 0.800 feature_norm=53.64\n",
"Iter 92 time=0.40 loss=9041.98 active=152117 precision=0.791 recall=0.739 F1=0.759 Acc(item/seq)=0.972 0.801 feature_norm=53.77\n",
"Iter 93 time=0.21 loss=9010.27 active=152117 precision=0.783 recall=0.736 F1=0.754 Acc(item/seq)=0.972 0.801 feature_norm=53.97\n",
"Iter 94 time=0.21 loss=8984.80 active=152117 precision=0.786 recall=0.740 F1=0.759 Acc(item/seq)=0.972 0.803 feature_norm=54.09\n",
"Iter 95 time=0.21 loss=8965.14 active=152117 precision=0.788 recall=0.740 F1=0.760 Acc(item/seq)=0.972 0.803 feature_norm=54.14\n",
"Iter 96 time=0.21 loss=8931.00 active=152117 precision=0.786 recall=0.737 F1=0.758 Acc(item/seq)=0.972 0.804 feature_norm=54.26\n",
"Iter 97 time=0.21 loss=8923.90 active=152117 precision=0.791 recall=0.737 F1=0.757 Acc(item/seq)=0.971 0.799 feature_norm=54.68\n",
"Iter 98 time=0.21 loss=8860.78 active=152117 precision=0.784 recall=0.735 F1=0.755 Acc(item/seq)=0.971 0.802 feature_norm=54.64\n",
"Iter 99 time=0.21 loss=8846.13 active=152117 precision=0.788 recall=0.737 F1=0.758 Acc(item/seq)=0.972 0.803 feature_norm=54.68\n",
"Iter 100 time=0.21 loss=8832.34 active=152117 precision=0.789 recall=0.737 F1=0.758 Acc(item/seq)=0.972 0.803 feature_norm=54.77\n",
"================================================\n",
"Label Precision Recall F1 Support\n",
"------- ----------- -------- ----- ---------\n",
"B-LOC 0.823 0.823 0.823 479\n",
"B-MISC 0.806 0.651 0.720 748\n",
"B-ORG 0.853 0.620 0.718 686\n",
"B-PER 0.752 0.855 0.800 703\n",
"I-LOC 0.583 0.547 0.565 64\n",
"I-MISC 0.645 0.507 0.568 215\n",
"I-ORG 0.806 0.684 0.740 396\n",
"I-PER 0.844 0.946 0.892 423\n",
"O 0.990 0.998 0.994 33973\n",
"------------------------------------------------\n",
"L-BFGS terminated with the maximum number of iterations\n",
"Total seconds required for training: 21.768\n",
"\n",
"Storing the model\n",
"Number of active features: 152117 (152117)\n",
"Number of active attributes: 130306 (145602)\n",
"Number of active labels: 9 (9)\n",
"Writing labels\n",
"Writing attributes\n",
"Writing feature references for transitions\n",
"Writing feature references for attributes\n",
"Seconds required: 0.058\n",
"\n"
]
},
{
"data": {
"text/plain": [
"CRF(algorithm='lbfgs', all_possible_states=None,\n",
" all_possible_transitions=None, averaging=None, c=None, c1=None, c2=None,\n",
" calibration_candidates=None, calibration_eta=None,\n",
" calibration_max_trials=None, calibration_rate=None,\n",
" calibration_samples=None, delta=None, epsilon=None, error_sensitive=None,\n",
" gamma=None, keep_tempfiles=None, linesearch=None, max_iterations=100,\n",
" max_linesearch=None, min_freq=None, model_filename=None,\n",
" num_memories=None, pa_type=None, period=None, trainer_cls=None,\n",
" variance=None, verbose='true')"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"crf = crfsuite.CRF(\n",
" verbose='true',\n",
" algorithm='lbfgs',\n",
" max_iterations=100\n",
")\n",
"\n",
"crf.fit(X_train, y_train, X_dev=X_dev, y_dev=y_dev)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['models/ner/crf_model']"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import joblib\n",
"import os\n",
"\n",
"OUTPUT_PATH = \"models/ner/\"\n",
"OUTPUT_FILE = \"crf_model\"\n",
"\n",
"if not os.path.exists(OUTPUT_PATH):\n",
" os.mkdir(OUTPUT_PATH)\n",
"\n",
"joblib.dump(crf, os.path.join(OUTPUT_PATH, OUTPUT_FILE))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Evaluation"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's evaluate the output of our CRF. We'll load the model from the output file above and have it predict labels for the full test set.\n",
"\n",
"As a sanity check, let's take a look at its predictions for the first test sentence. This output looks pretty good: the CRF is able to predict all four locations in the sentence correctly. It only misses the person entity, which is a strange case anyway, because it is not actually a person name."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Sentence: Dat is in Italië , Spanje of Engeland misschien geen probleem , maar volgens ' Der Kaiser ' in Duitsland wel .\n",
"Predicted: O O O B-LOC O B-LOC O B-LOC O O O O O O O B-MISC I-MISC O O B-LOC O O\n",
"Correct: O O O B-LOC O B-LOC O B-LOC O O O O O O O B-PER I-PER O O B-LOC O O\n"
]
}
],
"source": [
"crf = joblib.load(os.path.join(OUTPUT_PATH, OUTPUT_FILE))\n",
"y_pred = crf.predict(X_test)\n",
"\n",
"example_sent = test_sents[0]\n",
"\n",
"print(\"Sentence:\", ' '.join(sent2tokens(example_sent)))\n",
"print(\"Predicted:\", ' '.join(crf.predict([sent2features(example_sent, word2cluster)])[0]))\n",
"print(\"Correct: \", ' '.join(sent2labels(example_sent)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we evaluate on the full test set. We'll print out a classification report for all labels except `O`. If we were to include `O`, which far outnumbers the entity labels in our data, the average scores would be inflated artificially, simply because there's an inherently high probability that the `O` labels from our CRF are correct. We obtain an average F-score of 77% (micro average) across all entity types, with particularly good results for `B-LOC`and `B-PER`. "
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" precision recall f1-score support\n",
"\n",
" B-LOC 0.83 0.83 0.83 774\n",
" I-LOC 0.29 0.41 0.34 49\n",
" B-MISC 0.84 0.61 0.71 1187\n",
" I-MISC 0.59 0.42 0.49 410\n",
" B-ORG 0.80 0.69 0.74 882\n",
" I-ORG 0.74 0.66 0.70 551\n",
" B-PER 0.80 0.90 0.85 1098\n",
" I-PER 0.87 0.95 0.91 807\n",
"\n",
" micro avg 0.80 0.74 0.77 5758\n",
" macro avg 0.72 0.68 0.70 5758\n",
"weighted avg 0.80 0.74 0.76 5758\n",
"\n"
]
}
],
"source": [
"labels = list(crf.classes_)\n",
"labels.remove(\"O\")\n",
"y_pred = crf.predict(X_test)\n",
"sorted_labels = sorted(\n",
" labels,\n",
" key=lambda name: (name[1:], name[0])\n",
")\n",
"\n",
"print(metrics.flat_classification_report(y_test, y_pred, labels=sorted_labels))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we can also look at the most likely transitions the CRF has identified, and at the top features for every label. We'll do this with the `eli5` library, which helps us explain the predictions of machine learning models.\n",
"\n",
"The top transitions are quite intuitive: the most likely transitions are those within the same entity type (from a B-label to an O-label), and those where a B-label follows an O-label. \n",
"\n",
"The features, too, make sense. For example, if a word does not start with an uppercase letter, it is unlikely to be an entity. By contrast, a word is very likely to be a location if it ends in `ië`, which is indeed a very common suffix for locations in Dutch. Notice also how informative the embedding clusters are: for all entity types, the word clusters form some of the most informative features for the CRF. "
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
" <style>\n",
" table.eli5-weights tr:hover {\n",
" filter: brightness(85%);\n",
" }\n",
"</style>\n",
"\n",
"\n",
"\n",
" \n",
"\n",
" \n",
"\n",
" \n",
" \n",
"\n",
"\n",
"<table class=\"eli5-transition-features\" style=\"margin-bottom: 0.5em;\">\n",
" <thead>\n",
" <tr>\n",
" <td>From \\ To</td>\n",
" \n",
" <th>O</th>\n",
" \n",
" <th>B-LOC</th>\n",
" \n",
" <th>I-LOC</th>\n",
" \n",
" <th>B-MISC</th>\n",
" \n",
" <th>I-MISC</th>\n",
" \n",
" <th>B-ORG</th>\n",
" \n",
" <th>I-ORG</th>\n",
" \n",
" <th>B-PER</th>\n",
" \n",
" <th>I-PER</th>\n",
" \n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" \n",
" \n",
" <tr>\n",
" <th>O</th>\n",
" \n",
" <td style=\"background-color: hsl(120, 100.00%, 88.09%)\" title=\"O ⇒ O\">\n",
" 4.141\n",
" </td>\n",
" \n",
" <td style=\"background-color: hsl(120, 100.00%, 87.21%)\" title=\"O ⇒ B-LOC\">\n",
" 4.583\n",
" </td>\n",
" \n",
" <td style=\"background-color: hsl(0, 100.00%, 100.00%)\" title=\"O ⇒ I-LOC\">\n",
" 0.0\n",
" </td>\n",
" \n",
" <td style=\"background-color: hsl(120, 100.00%, 88.09%)\" title=\"O ⇒ B-MISC\">\n",
" 4.141\n",
" </td>\n",
" \n",
" <td style=\"background-color: hsl(0, 100.00%, 100.00%)\" title=\"O ⇒ I-MISC\">\n",
" 0.0\n",
" </td>\n",
" \n",
" <td style=\"background-color: hsl(120, 100.00%, 87.64%)\" title=\"O ⇒ B-ORG\">\n",
" 4.366\n",
" </td>\n",
" \n",
" <td style=\"background-color: hsl(0, 100.00%, 100.00%)\" title=\"O ⇒ I-ORG\">\n",
" 0.0\n",
" </td>\n",
" \n",
" <td style=\"background-color: hsl(120, 100.00%, 88.74%)\" title=\"O ⇒ B-PER\">\n",
" 3.819\n",
" </td>\n",
" \n",
" <td style=\"background-color: hsl(0, 100.00%, 100.00%)\" title=\"O ⇒ I-PER\">\n",
" 0.0\n",
" </td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <th>B-LOC</th>\n",
" \n",
" <td style=\"background-color: hsl(0, 100.00%, 98.34%)\" title=\"B-LOC ⇒ O\">\n",
" -0.248\n",
" </td>\n",
" \n",
" <td style=\"background-color: hsl(0, 100.00%, 98.20%)\" title=\"B-LOC ⇒ B-LOC\">\n",
" -0.279\n",
" </td>\n",
" \n",
" <td style=\"background-color: hsl(120, 100.00%, 82.62%)\" title=\"B-LOC ⇒ I-LOC\">\n",
" 7.101\n",
" </td>\n",
" \n",
" <td style=\"background-color: hsl(0, 100.00%, 100.00%)\" title=\"B-LOC ⇒ B-MISC\">\n",
" 0.0\n",
" </td>\n",
" \n",
" <td style=\"background-color: hsl(0, 100.00%, 100.00%)\" title=\"B-LOC ⇒ I-MISC\">\n",
" 0.0\n",
" </td>\n",
" \n",
" <td style=\"background-color: hsl(0, 100.00%, 100.00%)\" title=\"B-LOC ⇒ B-ORG\">\n",
" 0.0\n",
" </td>\n",
" \n",
" <td style=\"background-color: hsl(0, 100.00%, 100.00%)\" title=\"B-LOC ⇒ I-ORG\">\n",
" 0.0\n",
" </td>\n",
" \n",
" <td style=\"background-color: hsl(0, 100.00%, 96.70%)\" title=\"B-LOC ⇒ B-PER\">\n",
" -0.661\n",
" </td>\n",
" \n",
" <td style=\"background-color: hsl(0, 100.00%, 100.00%)\" title=\"B-LOC ⇒ I-PER\">\n",
" 0.0\n",
" </td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <th>I-LOC</th>\n",
" \n",
" <td style=\"background-color: hsl(0, 100.00%, 95.40%)\" title=\"I-LOC ⇒ O\">\n",
" -1.062\n",
" </td>\n",
" \n",
" <td style=\"background-color: hsl(0, 100.00%, 98.40%)\" title=\"I-LOC ⇒ B-LOC\">\n",
" -0.235\n",
" </td>\n",
" \n",
" <td style=\"background-color: hsl(120, 100.00%, 84.62%)\" title=\"I-LOC ⇒ I-LOC\">\n",
" 5.967\n",
" </td>\n",
" \n",
" <td style=\"background-color: hsl(0, 100.00%, 100.00%)\" title=\"I-LOC ⇒ B-MISC\">\n",
" 0.0\n",
" </td>\n",
" \n",
" <td style=\"background-color: hsl(0, 100.00%, 100.00%)\" title=\"I-LOC ⇒ I-MISC\">\n",
" 0.0\n",
" </td>\n",
" \n",
" <td style=\"background-color: hsl(0, 100.00%, 100.00%)\" title=\"I-LOC ⇒ B-ORG\">\n",
" 0.0\n",
" </td>\n",
" \n",
" <td style=\"background-color: hsl(0, 100.00%, 100.00%)\" title=\"I-LOC ⇒ I-ORG\">\n",
" 0.0\n",
" </td>\n",
" \n",
" <td style=\"background-color: hsl(0, 100.00%, 100.00%)\" title=\"I-LOC ⇒ B-PER\">\n",
" 0.0\n",
" </td>\n",
" \n",
" <td style=\"background-color: hsl(0, 100.00%, 100.00%)\" title=\"I-LOC ⇒ I-PER\">\n",
" 0.0\n",
" </td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <th>B-MISC</th>\n",
" \n",
" <td style=\"background-color: hsl(0, 100.00%, 95.64%)\" title=\"B-MISC ⇒ O\">\n",
" -0.985\n",
" </td>\n",
" \n",
" <td style=\"background-color: hsl(120, 100.00%, 96.73%)\" title=\"B-MISC ⇒ B-LOC\">\n",
" 0.655\n",
" </td>\n",
" \n",
" <td style=\"background-color: hsl(0, 100.00%, 100.00%)\" title=\"B-MISC ⇒ I-LOC\">\n",
" 0.0\n",
" </td>\n",
" \n",
" <td style=\"background-color: hsl(0, 100.00%, 98.03%)\" title=\"B-MISC ⇒ B-MISC\">\n",
" -0.316\n",
" </td>\n",
" \n",
" <td style=\"background-color: hsl(120, 100.00%, 81.56%)\" title=\"B-MISC ⇒ I-MISC\">\n",
" 7.73\n",
" </td>\n",
" \n",
" <td style=\"background-color: hsl(120, 100.00%, 97.10%)\" title=\"B-MISC ⇒ B-ORG\">\n",
" 0.551\n",
" </td>\n",
" \n",
" <td style=\"background-color: hsl(0, 100.00%, 100.00%)\" title=\"B-MISC ⇒ I-ORG\">\n",
" 0.0\n",
" </td>\n",
" \n",
" <td style=\"background-color: hsl(120, 100.00%, 97.44%)\" title=\"B-MISC ⇒ B-PER\">\n",
" 0.46\n",
" </td>\n",
" \n",
" <td style=\"background-color: hsl(0, 100.00%, 100.00%)\" title=\"B-MISC ⇒ I-PER\">\n",
" 0.0\n",
" </td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <th>I-MISC</th>\n",
" \n",
" <td style=\"background-color: hsl(0, 100.00%, 93.40%)\" title=\"I-MISC ⇒ O\">\n",
" -1.781\n",
" </td>\n",
" \n",
" <td style=\"background-color: hsl(0, 100.00%, 100.00%)\" title=\"I-MISC ⇒ B-LOC\">\n",
" 0.0\n",
" </td>\n",
" \n",
" <td style=\"background-color: hsl(0, 100.00%, 100.00%)\" title=\"I-MISC ⇒ I-LOC\">\n",
" 0.0\n",
" </td>\n",
" \n",
" <td style=\"background-color: hsl(0, 100.00%, 97.75%)\" title=\"I-MISC ⇒ B-MISC\">\n",
" -0.382\n",
" </td>\n",
" \n",
" <td style=\"background-color: hsl(120, 100.00%, 81.49%)\" title=\"I-MISC ⇒ I-MISC\">\n",
" 7.769\n",
" </td>\n",
" \n",
" <td style=\"background-color: hsl(120, 100.00%, 95.16%)\" title=\"I-MISC ⇒ B-ORG\">\n",
" 1.145\n",
" </td>\n",
" \n",
" <td style=\"background-color: hsl(0, 100.00%, 100.00%)\" title=\"I-MISC ⇒ I-ORG\">\n",
" 0.0\n",
" </td>\n",
" \n",
" <td style=\"background-color: hsl(0, 100.00%, 96.50%)\" title=\"I-MISC ⇒ B-PER\">\n",
" -0.719\n",
" </td>\n",
" \n",
" <td style=\"background-color: hsl(0, 100.00%, 100.00%)\" title=\"I-MISC ⇒ I-PER\">\n",
" 0.0\n",
" </td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <th>B-ORG</th>\n",
" \n",
" <td style=\"background-color: hsl(0, 100.00%, 98.28%)\" title=\"B-ORG ⇒ O\">\n",
" -0.261\n",
" </td>\n",
" \n",
" <td style=\"background-color: hsl(0, 100.00%, 100.00%)\" title=\"B-ORG ⇒ B-LOC\">\n",
" 0.0\n",
" </td>\n",
" \n",
" <td style=\"background-color: hsl(0, 100.00%, 100.00%)\" title=\"B-ORG ⇒ I-LOC\">\n",
" 0.0\n",
" </td>\n",
" \n",
" <td style=\"background-color: hsl(0, 100.00%, 96.20%)\" title=\"B-ORG ⇒ B-MISC\">\n",
" -0.809\n",
" </td>\n",
" \n",
" <td style=\"background-color: hsl(0, 100.00%, 100.00%)\" title=\"B-ORG ⇒ I-MISC\">\n",
" 0.0\n",
" </td>\n",
" \n",
" <td style=\"background-color: hsl(0, 100.00%, 100.00%)\" title=\"B-ORG ⇒ B-ORG\">\n",
" 0.0\n",
" </td>\n",
" \n",
" <td style=\"background-color: hsl(120, 100.00%, 81.44%)\" title=\"B-ORG ⇒ I-ORG\">\n",
" 7.803\n",
" </td>\n",
" \n",
" <td style=\"background-color: hsl(120, 100.00%, 99.09%)\" title=\"B-ORG ⇒ B-PER\">\n",
" 0.106\n",
" </td>\n",
" \n",
" <td style=\"background-color: hsl(0, 100.00%, 100.00%)\" title=\"B-ORG ⇒ I-PER\">\n",
" 0.0\n",
" </td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <th>I-ORG</th>\n",
" \n",
" <td style=\"background-color: hsl(0, 100.00%, 96.25%)\" title=\"I-ORG ⇒ O\">\n",
" -0.794\n",
" </td>\n",
" \n",
" <td style=\"background-color: hsl(0, 100.00%, 100.00%)\" title=\"I-ORG ⇒ B-LOC\">\n",
" 0.0\n",
" </td>\n",
" \n",
" <td style=\"background-color: hsl(0, 100.00%, 100.00%)\" title=\"I-ORG ⇒ I-LOC\">\n",
" 0.0\n",
" </td>\n",
" \n",
" <td style=\"background-color: hsl(0, 100.00%, 100.00%)\" title=\"I-ORG ⇒ B-MISC\">\n",
" 0.0\n",
" </td>\n",
" \n",
" <td style=\"background-color: hsl(0, 100.00%, 100.00%)\" title=\"I-ORG ⇒ I-MISC\">\n",
" 0.0\n",
" </td>\n",
" \n",
" <td style=\"background-color: hsl(0, 100.00%, 100.00%)\" title=\"I-ORG ⇒ B-ORG\">\n",
" 0.0\n",
" </td>\n",
" \n",
" <td style=\"background-color: hsl(120, 100.00%, 82.50%)\" title=\"I-ORG ⇒ I-ORG\">\n",
" 7.174\n",
" </td>\n",
" \n",
" <td style=\"background-color: hsl(120, 100.00%, 99.22%)\" title=\"I-ORG ⇒ B-PER\">\n",
" 0.084\n",
" </td>\n",
" \n",
" <td style=\"background-color: hsl(0, 100.00%, 100.00%)\" title=\"I-ORG ⇒ I-PER\">\n",
" 0.0\n",
" </td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <th>B-PER</th>\n",
" \n",
" <td style=\"background-color: hsl(120, 100.00%, 98.06%)\" title=\"B-PER ⇒ O\">\n",
" 0.31\n",
" </td>\n",
" \n",
" <td style=\"background-color: hsl(0, 100.00%, 97.90%)\" title=\"B-PER ⇒ B-LOC\">\n",
" -0.346\n",
" </td>\n",
" \n",
" <td style=\"background-color: hsl(0, 100.00%, 100.00%)\" title=\"B-PER ⇒ I-LOC\">\n",
" 0.0\n",
" </td>\n",
" \n",
" <td style=\"background-color: hsl(0, 100.00%, 96.88%)\" title=\"B-PER ⇒ B-MISC\">\n",
" -0.611\n",
" </td>\n",
" \n",
" <td style=\"background-color: hsl(0, 100.00%, 100.00%)\" title=\"B-PER ⇒ I-MISC\">\n",
" 0.0\n",
" </td>\n",
" \n",
" <td style=\"background-color: hsl(0, 100.00%, 100.00%)\" title=\"B-PER ⇒ B-ORG\">\n",
" 0.0\n",
" </td>\n",
" \n",
" <td style=\"background-color: hsl(0, 100.00%, 100.00%)\" title=\"B-PER ⇒ I-ORG\">\n",
" 0.0\n",
" </td>\n",
" \n",
" <td style=\"background-color: hsl(0, 100.00%, 94.40%)\" title=\"B-PER ⇒ B-PER\">\n",
" -1.408\n",
" </td>\n",
" \n",
" <td style=\"background-color: hsl(120, 100.00%, 80.00%)\" title=\"B-PER ⇒ I-PER\">\n",
" 8.68\n",
" </td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" <th>I-PER</th>\n",
" \n",
" <td style=\"background-color: hsl(120, 100.00%, 99.10%)\" title=\"I-PER ⇒ O\">\n",
" 0.104\n",
" </td>\n",
" \n",
" <td style=\"background-color: hsl(0, 100.00%, 100.00%)\" title=\"I-PER ⇒ B-LOC\">\n",
" 0.0\n",
" </td>\n",
" \n",
" <td style=\"background-color: hsl(0, 100.00%, 100.00%)\" title=\"I-PER ⇒ I-LOC\">\n",
" 0.0\n",
" </td>\n",
" \n",
" <td style=\"background-color: hsl(0, 100.00%, 100.00%)\" title=\"I-PER ⇒ B-MISC\">\n",
" 0.0\n",
" </td>\n",
" \n",
" <td style=\"background-color: hsl(0, 100.00%, 100.00%)\" title=\"I-PER ⇒ I-MISC\">\n",
" 0.0\n",
" </td>\n",
" \n",
" <td style=\"background-color: hsl(0, 100.00%, 100.00%)\" title=\"I-PER ⇒ B-ORG\">\n",
" 0.0\n",
" </td>\n",
" \n",
" <td style=\"background-color: hsl(0, 100.00%, 100.00%)\" title=\"I-PER ⇒ I-ORG\">\n",
" 0.0\n",
" </td>\n",
" \n",
" <td style=\"background-color: hsl(0, 100.00%, 100.00%)\" title=\"I-PER ⇒ B-PER\">\n",
" 0.0\n",
" </td>\n",
" \n",
" <td style=\"background-color: hsl(120, 100.00%, 83.13%)\" title=\"I-PER ⇒ I-PER\">\n",
" 6.804\n",
" </td>\n",
" \n",
" </tr>\n",
" \n",
" \n",
" </tbody>\n",
"</table>\n",
" \n",
"\n",
" \n",
"\n",
" \n",
"\n",
" \n",
"\n",
"\n",
" \n",
"\n",
" \n",
"\n",
" \n",
"\n",
" \n",
" \n",
"\n",
" \n",
" <table class=\"eli5-weights-wrapper\" style=\"border-collapse: collapse; border: none; margin-bottom: 1.5em;\">\n",
" <tr>\n",
" \n",
" <td style=\"padding: 0.5em; border: 1px solid black; text-align: center;\">\n",
" <b>\n",
" \n",
" y=O\n",
" \n",
"</b>\n",
"\n",
"top features\n",
" </td>\n",
" \n",
" <td style=\"padding: 0.5em; border: 1px solid black; text-align: center;\">\n",
" <b>\n",
" \n",
" y=B-LOC\n",
" \n",
"</b>\n",
"\n",
"top features\n",
" </td>\n",
" \n",
" <td style=\"padding: 0.5em; border: 1px solid black; text-align: center;\">\n",
" <b>\n",
" \n",
" y=I-LOC\n",
" \n",
"</b>\n",
"\n",
"top features\n",
" </td>\n",
" \n",
" <td style=\"padding: 0.5em; border: 1px solid black; text-align: center;\">\n",
" <b>\n",
" \n",
" y=B-MISC\n",
" \n",
"</b>\n",
"\n",
"top features\n",
" </td>\n",
" \n",
" <td style=\"padding: 0.5em; border: 1px solid black; text-align: center;\">\n",
" <b>\n",
" \n",
" y=I-MISC\n",
" \n",
"</b>\n",
"\n",
"top features\n",
" </td>\n",
" \n",
" <td style=\"padding: 0.5em; border: 1px solid black; text-align: center;\">\n",
" <b>\n",
" \n",
" y=B-ORG\n",
" \n",
"</b>\n",
"\n",
"top features\n",
" </td>\n",
" \n",
" <td style=\"padding: 0.5em; border: 1px solid black; text-align: center;\">\n",
" <b>\n",
" \n",
" y=I-ORG\n",
" \n",
"</b>\n",
"\n",
"top features\n",
" </td>\n",
" \n",
" <td style=\"padding: 0.5em; border: 1px solid black; text-align: center;\">\n",
" <b>\n",
" \n",
" y=B-PER\n",
" \n",
"</b>\n",
"\n",
"top features\n",
" </td>\n",
" \n",
" <td style=\"padding: 0.5em; border: 1px solid black; text-align: center;\">\n",
" <b>\n",
" \n",
" y=I-PER\n",
" \n",
"</b>\n",
"\n",
"top features\n",
" </td>\n",
" \n",
" </tr>\n",
" <tr>\n",
" \n",
" \n",
" <td style=\"padding: 0px; border: 1px solid black; vertical-align: top;\">\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" <table class=\"eli5-weights\"\n",
" style=\"border-collapse: collapse; border: none; margin-top: 0em; table-layout: auto; width: 100%;\">\n",
" <thead>\n",
" <tr style=\"border: none;\">\n",
" \n",
" <th style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\" title=\"Feature weights. Note that weights do not account for feature value scales, so if feature values have different scales, features with highest weights might not be the most important.\">\n",
" Weight<sup>?</sup>\n",
" </th>\n",
" \n",
" <th style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">Feature</th>\n",
" \n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 80.16%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +3.692\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.istitle=False\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 81.61%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +3.312\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.isupper=False\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 86.14%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +2.211\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" -1:word.lower="\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 87.08%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.999\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" -1:word.lower=+\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 87.84%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.835\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" EOS\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 87.90%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.820\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" BOS\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 88.28%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.740\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=158\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 88.54%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.684\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=415\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 88.99%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.590\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word[-2:]=ag\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 89.01%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.588\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=195\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 89.16%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.556\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word[-3:]=dag\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 89.34%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.520\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=185\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 89.34%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.520\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=178\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 89.53%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.481\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=24\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 89.55%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.477\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=177\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 89.56%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.475\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.lower=u\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 89.56%); border: none;\">\n",
" <td colspan=\"2\" style=\"padding: 0 0.5em 0 0.5em; text-align: center; border: none; white-space: nowrap;\">\n",
" <i>… 108173 more positive …</i>\n",
" </td>\n",
" </tr>\n",
" \n",
"\n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 89.36%); border: none;\">\n",
" <td colspan=\"2\" style=\"padding: 0 0.5em 0 0.5em; text-align: center; border: none; white-space: nowrap;\">\n",
" <i>… 6671 more negative …</i>\n",
" </td>\n",
" </tr>\n",
" \n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 89.36%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" -1.515\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=375\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 89.25%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" -1.538\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" -1:word.lower=ronde\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 89.14%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" -1.561\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" +1:word.lower=(\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 88.86%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" -1.618\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" 0\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 88.50%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" -1.693\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" -1:word.lower=de\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 88.28%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" -1.740\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" postag=N\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 87.96%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" -1.808\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word[-2:]='s\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 87.80%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" -1.843\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word[-3:]=ple\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 87.75%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" -1.853\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" postag=Misc\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 87.47%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" -1.914\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=6\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 87.33%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" -1.945\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" +2:word.lower=1\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 87.13%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" -1.990\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=111\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 86.78%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" -2.068\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.isupper=True\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 85.12%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" -2.447\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.istitle=True\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
"\n",
" </tbody>\n",
" </table>\n",
"\n",
" \n",
" \n",
" </td>\n",
" \n",
" <td style=\"padding: 0px; border: 1px solid black; vertical-align: top;\">\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" <table class=\"eli5-weights\"\n",
" style=\"border-collapse: collapse; border: none; margin-top: 0em; table-layout: auto; width: 100%;\">\n",
" <thead>\n",
" <tr style=\"border: none;\">\n",
" \n",
" <th style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\" title=\"Feature weights. Note that weights do not account for feature value scales, so if feature values have different scales, features with highest weights might not be the most important.\">\n",
" Weight<sup>?</sup>\n",
" </th>\n",
" \n",
" <th style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">Feature</th>\n",
" \n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 80.00%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +3.734\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=325\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 80.31%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +3.650\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=375\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 81.01%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +3.466\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=68\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 82.17%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +3.169\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=139\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 86.99%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +2.020\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" -1:word.lower=in\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 87.10%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.995\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=143\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 87.20%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.973\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=476\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 87.87%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.828\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=102\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 88.87%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.617\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word[-2:]=ië\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 89.25%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.538\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" -1:word.lower=(\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 89.36%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.516\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" -1:word.lower=uit\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 89.46%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.495\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=154\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 89.73%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.441\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" +1:word.lower=/\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 89.85%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.417\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" +2:word.lower=-\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 89.91%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.404\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" +2:word.lower=tel.\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 90.04%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.379\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" -1:word.lower=het\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 90.07%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.374\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=440\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 90.09%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.369\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.lower=sint-michiels\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 90.25%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.337\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word[-3:]=urs\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 90.34%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.320\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.lower=futuroscope\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 90.44%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.301\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word[-3:]=ope\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 90.44%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.301\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=363\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 90.56%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.278\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.lower=vst\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 90.56%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.278\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word[-3:]=VSt\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 90.67%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.257\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word[-3:]=den\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 90.73%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.245\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=146\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 90.73%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.244\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=116\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 90.90%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.212\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word[-2:]=St\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 90.98%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.198\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=492\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 90.98%); border: none;\">\n",
" <td colspan=\"2\" style=\"padding: 0 0.5em 0 0.5em; text-align: center; border: none; white-space: nowrap;\">\n",
" <i>… 4441 more positive …</i>\n",
" </td>\n",
" </tr>\n",
" \n",
"\n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 90.33%); border: none;\">\n",
" <td colspan=\"2\" style=\"padding: 0 0.5em 0 0.5em; text-align: center; border: none; white-space: nowrap;\">\n",
" <i>… 668 more negative …</i>\n",
" </td>\n",
" </tr>\n",
" \n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 90.33%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" -1.322\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" -2:word.lower=ronde\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
"\n",
" </tbody>\n",
" </table>\n",
"\n",
" \n",
" \n",
" </td>\n",
" \n",
" <td style=\"padding: 0px; border: 1px solid black; vertical-align: top;\">\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" <table class=\"eli5-weights\"\n",
" style=\"border-collapse: collapse; border: none; margin-top: 0em; table-layout: auto; width: 100%;\">\n",
" <thead>\n",
" <tr style=\"border: none;\">\n",
" \n",
" <th style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\" title=\"Feature weights. Note that weights do not account for feature value scales, so if feature values have different scales, features with highest weights might not be the most important.\">\n",
" Weight<sup>?</sup>\n",
" </th>\n",
" \n",
" <th style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">Feature</th>\n",
" \n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 88.10%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.778\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=238\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 89.50%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.488\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" +2:word.lower=m\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 90.10%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.367\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=161\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 92.07%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +0.996\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" -1:word.lower=col\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 92.17%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +0.977\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word[-2:]=rk\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 92.89%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +0.852\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" -1:word.istitle=False\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 93.07%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +0.821\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word[-2:]=al\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 93.14%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +0.810\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.lower=york\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 93.14%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +0.810\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word[-3:]=ork\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 93.14%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +0.809\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word[-3:]=eum\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 93.31%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +0.781\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" -1:word.lower=san\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 93.79%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +0.703\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.lower=abeba\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 93.79%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +0.703\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" -1:word.lower=addis\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 93.79%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +0.702\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word[-3:]=eba\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 93.79%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +0.701\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" -1:word.isupper=True\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 93.91%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +0.683\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" -2:word.lower='\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 93.92%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +0.682\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" +2:word.lower=,\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 93.95%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +0.677\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word[-3:]=den\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 93.95%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +0.676\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word[-3:]=aan\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 93.97%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +0.674\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word[-2:]=um\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 93.99%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +0.670\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.lower=staten\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 94.00%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +0.669\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=38\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 94.05%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +0.661\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" postag=Prep\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 94.12%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +0.650\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" -2:word.lower=col\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 94.12%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +0.650\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" -1:postag=Art\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 94.13%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +0.648\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" -1:word.lower=museum\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 94.15%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +0.645\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word[-2:]=ba\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 94.16%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +0.643\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.lower=hotel\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 94.16%); border: none;\">\n",
" <td colspan=\"2\" style=\"padding: 0 0.5em 0 0.5em; text-align: center; border: none; white-space: nowrap;\">\n",
" <i>… 1294 more positive …</i>\n",
" </td>\n",
" </tr>\n",
" \n",
"\n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 93.97%); border: none;\">\n",
" <td colspan=\"2\" style=\"padding: 0 0.5em 0 0.5em; text-align: center; border: none; white-space: nowrap;\">\n",
" <i>… 131 more negative …</i>\n",
" </td>\n",
" </tr>\n",
" \n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 93.97%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" -0.674\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" EOS\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 93.31%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" -0.781\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" postag=Adj\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
"\n",
" </tbody>\n",
" </table>\n",
"\n",
" \n",
" \n",
" </td>\n",
" \n",
" <td style=\"padding: 0px; border: 1px solid black; vertical-align: top;\">\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" <table class=\"eli5-weights\"\n",
" style=\"border-collapse: collapse; border: none; margin-top: 0em; table-layout: auto; width: 100%;\">\n",
" <thead>\n",
" <tr style=\"border: none;\">\n",
" \n",
" <th style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\" title=\"Feature weights. Note that weights do not account for feature value scales, so if feature values have different scales, features with highest weights might not be the most important.\">\n",
" Weight<sup>?</sup>\n",
" </th>\n",
" \n",
" <th style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">Feature</th>\n",
" \n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 81.59%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +3.318\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=23\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 84.92%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +2.494\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=100\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 85.24%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +2.419\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=39\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 86.65%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +2.097\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" +2:word.lower=1\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 86.91%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +2.039\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=294\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 86.92%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +2.036\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=338\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 88.06%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.786\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word[-2:]='s\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 88.13%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.772\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=11\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 88.47%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.700\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.lower=sport\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 88.73%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.646\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.lower=buitenland\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 88.81%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.628\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word[-2:]=se\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 88.86%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.618\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=222\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 89.27%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.535\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=427\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 89.41%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.505\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.lower=journaal\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 89.54%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.478\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=40\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 89.85%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.417\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word[-3:]=nse\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 89.88%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.411\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" postag=Adj\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 90.06%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.374\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=111\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 90.19%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.350\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.lower=aula\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 90.20%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.347\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.lower=tobin-taks\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 90.30%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.329\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=218\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 90.44%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.301\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=128\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 90.58%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.274\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word[-3:]=ula\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 90.84%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.223\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" BOS\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 90.90%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.213\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.lower=tobin-heffing\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 90.94%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.204\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.lower=ronde\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 90.94%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.204\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word[-2:]=ks\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 90.98%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.197\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.lower=v-plan\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 91.05%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.184\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word[-2:]=ex\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 91.05%); border: none;\">\n",
" <td colspan=\"2\" style=\"padding: 0 0.5em 0 0.5em; text-align: center; border: none; white-space: nowrap;\">\n",
" <i>… 5588 more positive …</i>\n",
" </td>\n",
" </tr>\n",
" \n",
"\n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 90.67%); border: none;\">\n",
" <td colspan=\"2\" style=\"padding: 0 0.5em 0 0.5em; text-align: center; border: none; white-space: nowrap;\">\n",
" <i>… 853 more negative …</i>\n",
" </td>\n",
" </tr>\n",
" \n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 90.67%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" -1.256\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" -1:word.isupper=True\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
"\n",
" </tbody>\n",
" </table>\n",
"\n",
" \n",
" \n",
" </td>\n",
" \n",
" <td style=\"padding: 0px; border: 1px solid black; vertical-align: top;\">\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" <table class=\"eli5-weights\"\n",
" style=\"border-collapse: collapse; border: none; margin-top: 0em; table-layout: auto; width: 100%;\">\n",
" <thead>\n",
" <tr style=\"border: none;\">\n",
" \n",
" <th style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\" title=\"Feature weights. Note that weights do not account for feature value scales, so if feature values have different scales, features with highest weights might not be the most important.\">\n",
" Weight<sup>?</sup>\n",
" </th>\n",
" \n",
" <th style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">Feature</th>\n",
" \n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 88.04%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.792\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" -2:word.lower=ronde\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 88.54%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.684\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" -1:word.isupper=True\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 89.11%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.567\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" -1:word.lower=ronde\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 90.28%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.332\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=37\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 90.32%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.323\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" +1:word.lower=ned\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 90.36%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.316\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=325\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 90.45%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.298\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=1\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 90.58%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.274\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.lower=leven\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 90.89%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.215\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" -1:word.istitle=True\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 90.96%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.201\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" -1:postag=Num\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 91.06%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.182\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" -1:word.lower=euro\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 91.17%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.161\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=86\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 91.24%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.147\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.lower=bijsluiter\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 91.48%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.103\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" -1:word.lower=financiële\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 91.50%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.100\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" +2:word.lower=ned\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 91.52%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.096\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word[-2:]=ap\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 91.67%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.068\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" postag=Num\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 91.82%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.042\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=413\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 91.89%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.029\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word[-2:]=00\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 91.98%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.012\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" -1:word.lower=de\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 91.99%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.010\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" -1:postag=Prep\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 92.16%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +0.979\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=279\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 92.31%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +0.954\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" -1:word.lower=travel\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 92.33%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +0.950\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" -1:word.lower=formule\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 92.34%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +0.947\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.lower=financiële\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 92.36%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +0.945\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" -1:word.lower=brussel\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 92.52%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +0.916\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" +1:word.lower='\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 92.52%); border: none;\">\n",
" <td colspan=\"2\" style=\"padding: 0 0.5em 0 0.5em; text-align: center; border: none; white-space: nowrap;\">\n",
" <i>… 3172 more positive …</i>\n",
" </td>\n",
" </tr>\n",
" \n",
"\n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 91.42%); border: none;\">\n",
" <td colspan=\"2\" style=\"padding: 0 0.5em 0 0.5em; text-align: center; border: none; white-space: nowrap;\">\n",
" <i>… 465 more negative …</i>\n",
" </td>\n",
" </tr>\n",
" \n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 91.42%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" -1.115\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" postag=V\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 89.60%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" -1.467\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" +1:postag=Num\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 83.03%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" -2.953\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" -1:postag=Adj\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
"\n",
" </tbody>\n",
" </table>\n",
"\n",
" \n",
" \n",
" </td>\n",
" \n",
" <td style=\"padding: 0px; border: 1px solid black; vertical-align: top;\">\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" <table class=\"eli5-weights\"\n",
" style=\"border-collapse: collapse; border: none; margin-top: 0em; table-layout: auto; width: 100%;\">\n",
" <thead>\n",
" <tr style=\"border: none;\">\n",
" \n",
" <th style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\" title=\"Feature weights. Note that weights do not account for feature value scales, so if feature values have different scales, features with highest weights might not be the most important.\">\n",
" Weight<sup>?</sup>\n",
" </th>\n",
" \n",
" <th style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">Feature</th>\n",
" \n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 83.66%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +2.798\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=424\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 84.33%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +2.635\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=228\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 86.54%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +2.121\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word[-3:]=com\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 87.12%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.991\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=187\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 87.20%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.974\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.lower=quizpeople\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 87.44%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.922\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word[-3:]=ple\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 87.78%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.848\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" +1:word.lower=morgen\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 88.01%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.798\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=250\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 88.55%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.683\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=83\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 89.14%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.560\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=29\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 89.25%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.538\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word[-2:]=ga\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 89.31%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.525\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" BOS\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 89.44%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.500\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word[-3:]=bel\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 89.55%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.476\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" -2:word.lower=minister\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 89.82%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.422\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=207\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 89.85%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.418\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=249\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 89.87%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.412\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.lower=waterleau\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 89.88%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.411\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.isupper=True\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 90.02%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.384\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.lower=belga\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 90.02%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.384\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word[-3:]=lga\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 90.30%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.328\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word[-3:]=our\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 90.34%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.320\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word[-2:]=om\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 90.35%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.319\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word[-2:]=to\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 90.45%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.299\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.lower=freenet\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 90.56%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.277\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.lower=baan\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 90.76%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.240\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" -1:word.lower=bij\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 90.82%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.227\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word[-3:]=lux\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 90.82%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.227\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word[-2:]=co\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 90.85%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.222\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=21\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 90.85%); border: none;\">\n",
" <td colspan=\"2\" style=\"padding: 0 0.5em 0 0.5em; text-align: center; border: none; white-space: nowrap;\">\n",
" <i>… 3591 more positive …</i>\n",
" </td>\n",
" </tr>\n",
" \n",
"\n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 91.20%); border: none;\">\n",
" <td colspan=\"2\" style=\"padding: 0 0.5em 0 0.5em; text-align: center; border: none; white-space: nowrap;\">\n",
" <i>… 469 more negative …</i>\n",
" </td>\n",
" </tr>\n",
" \n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 91.20%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" -1.156\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.isupper=False\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
"\n",
" </tbody>\n",
" </table>\n",
"\n",
" \n",
" \n",
" </td>\n",
" \n",
" <td style=\"padding: 0px; border: 1px solid black; vertical-align: top;\">\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" <table class=\"eli5-weights\"\n",
" style=\"border-collapse: collapse; border: none; margin-top: 0em; table-layout: auto; width: 100%;\">\n",
" <thead>\n",
" <tr style=\"border: none;\">\n",
" \n",
" <th style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\" title=\"Feature weights. Note that weights do not account for feature value scales, so if feature values have different scales, features with highest weights might not be the most important.\">\n",
" Weight<sup>?</sup>\n",
" </th>\n",
" \n",
" <th style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">Feature</th>\n",
" \n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 90.25%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.338\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.lower=morgen\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 90.42%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.304\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=403\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 90.74%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.243\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=413\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 90.96%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.200\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" -1:word.lower=vlaams\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 91.28%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.141\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=187\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 91.39%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.120\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word[-3:]=gen\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 91.49%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.101\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word[-3:]=ion\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 91.73%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.057\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" -1:word.lower=radio\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 91.89%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.028\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=321\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 92.22%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +0.970\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" -1:word.lower=ned\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 92.91%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +0.849\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word[-2:]=nt\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 92.95%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +0.841\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=143\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 93.17%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +0.805\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" -2:word.lower=voor\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 93.39%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +0.767\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=375\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 93.40%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +0.767\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" -1:postag=Misc\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 93.42%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +0.763\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word[-2:]=es\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 93.44%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +0.760\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word[-2:]=3\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 93.44%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +0.760\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.lower=3\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 93.44%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +0.760\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word[-3:]=3\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 93.53%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +0.745\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=478\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 93.60%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +0.733\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=411\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 93.69%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +0.719\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word[-2:]=ey\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 93.77%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +0.705\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word[-3:]=ola\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 93.79%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +0.702\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word[-2:]=le\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 93.79%); border: none;\">\n",
" <td colspan=\"2\" style=\"padding: 0 0.5em 0 0.5em; text-align: center; border: none; white-space: nowrap;\">\n",
" <i>… 2209 more positive …</i>\n",
" </td>\n",
" </tr>\n",
" \n",
"\n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 93.33%); border: none;\">\n",
" <td colspan=\"2\" style=\"padding: 0 0.5em 0 0.5em; text-align: center; border: none; white-space: nowrap;\">\n",
" <i>… 250 more negative …</i>\n",
" </td>\n",
" </tr>\n",
" \n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 93.33%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" -0.777\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" -2:postag=V\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 93.24%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" -0.793\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word[-2:]=in\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 93.06%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" -0.824\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" -1:word.lower=financiële\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 92.95%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" -0.842\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" -1:postag=Num\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 91.55%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" -1.090\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" 0\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 91.16%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" -1.162\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" postag=V\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
"\n",
" </tbody>\n",
" </table>\n",
"\n",
" \n",
" \n",
" </td>\n",
" \n",
" <td style=\"padding: 0px; border: 1px solid black; vertical-align: top;\">\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" <table class=\"eli5-weights\"\n",
" style=\"border-collapse: collapse; border: none; margin-top: 0em; table-layout: auto; width: 100%;\">\n",
" <thead>\n",
" <tr style=\"border: none;\">\n",
" \n",
" <th style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\" title=\"Feature weights. Note that weights do not account for feature value scales, so if feature values have different scales, features with highest weights might not be the most important.\">\n",
" Weight<sup>?</sup>\n",
" </th>\n",
" \n",
" <th style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">Feature</th>\n",
" \n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 80.80%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +3.523\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=489\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 83.29%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +2.888\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=204\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 83.58%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +2.818\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=301\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 83.63%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +2.804\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=3\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 83.79%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +2.765\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=246\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 85.13%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +2.444\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=337\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 85.24%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +2.419\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=6\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 85.49%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +2.361\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=326\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 86.19%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +2.199\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=296\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 86.77%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +2.069\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=87\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 87.20%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.975\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=349\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 88.34%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.727\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=12\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 89.07%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.575\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=105\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 89.10%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.569\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=350\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 89.69%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.448\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.lower=bode\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 89.74%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.439\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=85\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 90.20%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.347\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word[-2:]=ov\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 90.24%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.340\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" +1:word.lower=(\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 90.27%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.333\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" -1:word.lower=-\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 90.34%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.320\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=111\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 90.41%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.307\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" +2:word.lower=--\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 90.44%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.301\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word[-2:]=ff\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 90.44%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.300\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" -1:postag=V\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 90.71%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.248\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word[-2:]=án\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 90.72%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.247\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" -1:word.lower=volgens\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 90.80%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.231\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word[-3:]=par\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 90.80%); border: none;\">\n",
" <td colspan=\"2\" style=\"padding: 0 0.5em 0 0.5em; text-align: center; border: none; white-space: nowrap;\">\n",
" <i>… 7464 more positive …</i>\n",
" </td>\n",
" </tr>\n",
" \n",
"\n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 90.19%); border: none;\">\n",
" <td colspan=\"2\" style=\"padding: 0 0.5em 0 0.5em; text-align: center; border: none; white-space: nowrap;\">\n",
" <i>… 851 more negative …</i>\n",
" </td>\n",
" </tr>\n",
" \n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 90.19%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" -1.349\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word[-3:]=ing\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 90.14%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" -1.360\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word[-2:]=ur\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 88.90%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" -1.610\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" -1:word.lower=in\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 87.37%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" -1.937\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" -1:postag=Art\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
"\n",
" </tbody>\n",
" </table>\n",
"\n",
" \n",
" \n",
" </td>\n",
" \n",
" <td style=\"padding: 0px; border: 1px solid black; vertical-align: top;\">\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" <table class=\"eli5-weights\"\n",
" style=\"border-collapse: collapse; border: none; margin-top: 0em; table-layout: auto; width: 100%;\">\n",
" <thead>\n",
" <tr style=\"border: none;\">\n",
" \n",
" <th style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\" title=\"Feature weights. Note that weights do not account for feature value scales, so if feature values have different scales, features with highest weights might not be the most important.\">\n",
" Weight<sup>?</sup>\n",
" </th>\n",
" \n",
" <th style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">Feature</th>\n",
" \n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 88.24%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.748\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" -1:word.lower=van\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 89.81%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.425\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=3\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 90.38%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.313\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=388\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 90.51%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.287\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=450\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 90.70%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.250\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=6\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 90.80%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.231\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=249\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 91.22%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.152\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" +1:word.lower=(\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 91.35%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.127\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" +2:word.lower=die\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 91.80%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +1.044\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.lower=gucht\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 92.21%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +0.970\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=337\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 92.46%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +0.927\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" -1:word.lower=pu\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 92.46%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +0.927\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word[-3:]=nfu\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 92.46%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +0.927\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.lower=shenfu\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 92.47%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +0.925\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word[-2:]=fu\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 92.49%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +0.921\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=296\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 92.64%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +0.895\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.lower=grauwe\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 92.76%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +0.875\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word[-3:]=uwe\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 92.79%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +0.869\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word[-2:]=we\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 92.80%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +0.867\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" -1:word.lower=de\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 92.95%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +0.841\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word[-2:]=ck\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 93.11%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +0.815\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" -1:postag=Pron\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 93.11%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +0.814\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.lower=den\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 93.11%); border: none;\">\n",
" <td colspan=\"2\" style=\"padding: 0 0.5em 0 0.5em; text-align: center; border: none; white-space: nowrap;\">\n",
" <i>… 4971 more positive …</i>\n",
" </td>\n",
" </tr>\n",
" \n",
"\n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 93.11%); border: none;\">\n",
" <td colspan=\"2\" style=\"padding: 0 0.5em 0 0.5em; text-align: center; border: none; white-space: nowrap;\">\n",
" <i>… 548 more negative …</i>\n",
" </td>\n",
" </tr>\n",
" \n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 93.11%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" -0.814\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word[-2:]=ma\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 93.05%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" -0.824\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word[-3:]=aan\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 92.90%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" -0.851\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word[-2:]=al\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 92.26%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" -0.961\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.cluster=238\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 91.87%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" -1.032\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word.istitle=False\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 91.74%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" -1.055\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" postag=V\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 91.65%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" -1.071\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word[-3:]=ter\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 91.61%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" -1.079\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" word[-2:]=um\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
"\n",
" </tbody>\n",
" </table>\n",
"\n",
" \n",
" \n",
" </td>\n",
" \n",
" \n",
" </tr>\n",
" </table>\n",
" \n",
"\n",
" \n",
" \n",
"\n",
"\n",
" \n",
" \n",
"\n",
"\n",
" \n",
" \n",
"\n",
"\n",
" \n",
" \n",
"\n",
"\n",
" \n",
" \n",
"\n",
"\n",
" \n",
" \n",
"\n",
"\n",
" \n",
" \n",
"\n",
"\n",
" \n",
" \n",
"\n",
"\n",
" \n",
" \n",
"\n",
"\n",
" \n",
"\n",
"\n",
" \n",
"\n",
" \n",
"\n",
" \n",
"\n",
"\n",
" \n",
"\n",
" \n",
"\n",
" \n",
"\n",
" \n",
"\n",
" \n",
"\n",
" \n",
"\n",
"\n",
" \n",
"\n",
" \n",
"\n",
" \n",
"\n",
" \n",
"\n",
" \n",
"\n",
" \n",
"\n",
"\n",
"\n"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import eli5\n",
"\n",
"eli5.show_weights(crf, top=30)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Finding the optimal hyperparameters"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So far we've trained a model with the default parameters. It's unlikely that these will give us the best performance possible. Therefore we're going to search automatically for the best hyperparameter settings by iteratively training different models and evaluating them. Eventually we'll pick the best one.\n",
"\n",
"Here we'll focus on two parameters: `c1` and `c2`. These are the parameters for L1 and L2 regularization, respectively. Regularization prevents overfitting on the training data by adding a penalty to the loss function. In L1 regularization, this penalty is the sum of the absolute values of the weights; in L2 regularization, it is the sum of the squared weights. L1 regularization performs a type of feature selection, as it assigns 0 weight to irrelevant features. L2 regularization, by contrast, makes the weight of irrelevant features small, but not necessarily zero. L1 regularization is often called the Lasso method, L2 is called the Ridge method, and the linear combination of both is called Elastic Net regularization.\n",
"\n",
"We define the parameter space for c1 and c2 and use the flat F1-score to compare the individual models. We'll rely on three-fold cross validation to score each of the 50 candidates. We use a randomized search, which means we're not going to try out all specified parameter settings, but instead, we'll let the process sample randomly from the distributions we've specified in the parameter space. It will do this 50 (`n_iter`) times. This process takes a while, but it's worth the wait."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Fitting 3 folds for each of 50 candidates, totalling 150 fits\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.\n",
"[Parallel(n_jobs=-1)]: Done 18 tasks | elapsed: 3.2min\n",
"[Parallel(n_jobs=-1)]: Done 150 out of 150 | elapsed: 24.3min finished\n"
]
},
{
"data": {
"text/plain": [
"RandomizedSearchCV(cv=3, error_score='raise-deprecating',\n",
" estimator=CRF(algorithm='lbfgs', all_possible_states=None,\n",
" all_possible_transitions=True, averaging=None, c=None, c1=None, c2=None,\n",
" calibration_candidates=None, calibration_eta=None,\n",
" calibration_max_trials=None, calibration_rate=None,\n",
" calibration_samples=None, delta=None, epsilon=None, error...e,\n",
" num_memories=None, pa_type=None, period=None, trainer_cls=None,\n",
" variance=None, verbose=False),\n",
" fit_params=None, iid='warn', n_iter=50, n_jobs=-1,\n",
" param_distributions={'c1': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7f9947f04e10>, 'c2': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7f9947f04c88>},\n",
" pre_dispatch='2*n_jobs', random_state=None, refit=True,\n",
" return_train_score='warn',\n",
" scoring=make_scorer(flat_f1_score, average=weighted, labels=['B-ORG', 'B-MISC', 'B-PER', 'I-PER', 'B-LOC', 'I-MISC', 'I-ORG', 'I-LOC']),\n",
" verbose=1)"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import scipy\n",
"from sklearn.metrics import make_scorer\n",
"from sklearn.model_selection import RandomizedSearchCV\n",
"\n",
"crf = crfsuite.CRF(\n",
" algorithm='lbfgs',\n",
" max_iterations=100,\n",
" all_possible_transitions=True\n",
")\n",
"\n",
"params_space = {\n",
" 'c1': scipy.stats.expon(scale=0.5),\n",
" 'c2': scipy.stats.expon(scale=0.05),\n",
"}\n",
"\n",
"f1_scorer = make_scorer(metrics.flat_f1_score,\n",
" average='weighted', labels=labels)\n",
"\n",
"rs = RandomizedSearchCV(crf, params_space,\n",
" cv=3,\n",
" verbose=1,\n",
" n_jobs=-1,\n",
" n_iter=50,\n",
" scoring=f1_scorer)\n",
"rs.fit(X_train, y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's take a look at the best hyperparameter settings. Our random search suggests a combination of L1 and L2 normalization."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"best params: {'c1': 0.08869645933566639, 'c2': 0.005642379370340676}\n",
"best CV score: 0.7608794798691931\n",
"model size: 1.06M\n"
]
}
],
"source": [
"print('best params:', rs.best_params_)\n",
"print('best CV score:', rs.best_score_)\n",
"print('model size: {:0.2f}M'.format(rs.best_estimator_.size_ / 1000000))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To find out what precision, recall and F1-score this translates to, we take the best estimator from our random search and evaluate it on the test set. This indeed shows a nice improvement from our initial model. We've gone from an average F1-score of 77% to 79.1%. Both precision and recall have improved, and we see a positive result for all four entity types."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" precision recall f1-score support\n",
"\n",
" B-LOC 0.849 0.863 0.856 774\n",
" I-LOC 0.359 0.571 0.441 49\n",
" B-MISC 0.847 0.622 0.717 1187\n",
" I-MISC 0.664 0.415 0.511 410\n",
" B-ORG 0.806 0.727 0.764 882\n",
" I-ORG 0.772 0.677 0.721 551\n",
" B-PER 0.834 0.903 0.867 1098\n",
" I-PER 0.892 0.958 0.924 807\n",
"\n",
" micro avg 0.823 0.761 0.791 5758\n",
" macro avg 0.753 0.717 0.725 5758\n",
"weighted avg 0.820 0.761 0.784 5758\n",
"\n"
]
}
],
"source": [
"best_crf = rs.best_estimator_\n",
"y_pred = best_crf.predict(X_test)\n",
"print(metrics.flat_classification_report(\n",
" y_test, y_pred, labels=sorted_labels, digits=3\n",
"))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Conclusions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Conditional Random Fields have lost some of their popularity since the advent of neural-network models. Still, they can be very effective for named entity recognition, particularly when word embedding information is taken into account. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.1"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
自然语言处理中的一个经典挑战是序列标注。在序列标注中,目标是为文本中的每个单词标记一个词类。在词性标注中,这些词类是词性,例如名词或动词。在命名实体识别(NER)中,它们是通用命名实体的类型,例如位置、人物或组织,或更专业的实体,例如医疗保健领域中的疾病或症状。通过这种方式,序列标注可以帮助我们从文本中提取最重要的信息,并提高分析、搜索或匹配应用程序的性能。
在本笔记本中,我们将探索条件随机场,这在深度学习出现之前是最流行的序列标注方法。深度学习现在可能吸引了所有人的注意力,但条件随机场仍然是构建简单序列标注器的强大工具。
我们将使用 sklearn-crfsuite
工具。这是一个围绕 python-crfsuite
的包装器,而 python-crfsuite
本身是 CRFSuite 的 Python 绑定。我们使用 sklearn-crfsuite
的原因是它提供了一些方便的实用函数,例如用于评估模型的输出。您可以使用 pip install sklearn-crfsuite
安装它。
首先,我们获取一些数据。一个著名的用于训练和测试 NER 模型的数据集是 CoNLL-2002 数据,它包含带有四种实体类型标签的西班牙语和荷兰语文本:位置(LOC)、人物(PER)、组织(ORG)和杂项实体(MISC)。这两个语料库都分为三个部分:一个训练部分和两个较小的测试部分,其中一个我们将用作开发数据。从 NLTK 收集数据很容易。
import nltk
import sklearn
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import LabelBinarizer
import sklearn_crfsuite as crfsuite
from sklearn_crfsuite import metrics
train_sents = list(nltk.corpus.conll2002.iob_sents('ned.train'))
dev_sents = list(nltk.corpus.conll2002.iob_sents('ned.testa'))
test_sents = list(nltk.corpus.conll2002.iob_sents('ned.testb'))
数据包含一个标记化句子的列表。对于每个标记,我们都有它本身的字符串、它的词性标签和它的实体标签,它遵循 BIO 约定。在我们今天所处的深度学习世界中,忽略词性标签很常见。但是,由于 CRF 依赖于良好的特征提取,我们将很乐意利用这些信息。毕竟,单词的词性告诉我们很多关于它作为命名实体的可能状态的信息:例如,名词比动词更有可能是实体。
train_sents[0]
[('De', 'Art', 'O'),
('tekst', 'N', 'O'),
('van', 'Prep', 'O'),
('het', 'Art', 'O'),
('arrest', 'N', 'O'),
('is', 'V', 'O'),
('nog', 'Adv', 'O'),
('niet', 'Adv', 'O'),
('schriftelijk', 'Adj', 'O'),
('beschikbaar', 'Adj', 'O'),
('maar', 'Conj', 'O'),
('het', 'Art', 'O'),
('bericht', 'N', 'O'),
('werd', 'V', 'O'),
('alvast', 'Adv', 'O'),
('bekendgemaakt', 'V', 'O'),
('door', 'Prep', 'O'),
('een', 'Art', 'O'),
('communicatiebureau', 'N', 'O'),
('dat', 'Conj', 'O'),
('Floralux', 'N', 'B-ORG'),
('inhuurde', 'V', 'O'),
('.', 'Punc', 'O')]
虽然如今人们期望神经网络自己学习输入文本的相关特征,但这与条件随机场完全不同。CRF 学习我们给它们的特征与给定上下文中标记的标签之间的关系。它们不会自己学习这些特征。相反,模型的质量将高度依赖于我们向它展示的特征的相关性。
因此,本教程中最重要的方法是收集每个标记的特征的方法。哪些信息可能有用?当然,单词本身,连同它的词性标签。了解单词是否完全大写、是否以大写字母开头或是否为数字也可能很有趣。此外,我们还查看了单词结尾的字符二元语法和三元语法。我们还为每个标记提供了一个 bias
特征,它始终具有相同的值。此偏差特征有助于 CRF 学习每个标签类型在训练数据中的相对频率。
为了向 CRF 提供更多关于单词含义的信息,我们还引入了来自词嵌入的信息。在我们的 词嵌入笔记本 中,我们在荷兰语维基百科上训练了词嵌入,并将它们聚类到 500 个集群中。在这里,我们将从文件读取这 500 个集群,并将每个单词映射到它所属集群的 ID。这对命名实体识别非常有用,因为大多数实体类型都聚类在一起。这允许 CRF 在词级以上进行概括。例如,当 CRF 遇到一个从未见过的单词(例如,Albania)时,它可以根据单词所属的集群做出决定。如果此集群包含 CRF 在其训练数据中遇到的许多其他实体(例如,Italy、Germany 和 France),它将学习此集群与特定实体类型之间的字符串链接。因此,它仍然可以将该实体类型分配给未知单词。在我们的实验中,仅此一项特征就将性能提升了约 3%。
最后,除了标记本身之外,我们还希望 CRF 查看其上下文。更具体地说,我们将为它提供一些关于目标单词左侧和右侧两个单词的额外信息。我们将告诉 CRF 这些单词是什么、它们是否以大写字母开头或是否完全大写,并提供它们的词性标签。如果没有左或右上下文,我们将通知 CRF 该标记位于句子的开头或结尾(BOS
或 EOS
)。
def read_clusters(cluster_file):
word2cluster = {}
with open(cluster_file) as i:
for line in i:
word, cluster = line.strip().split('\t')
word2cluster[word] = cluster
return word2cluster
def word2features(sent, i, word2cluster):
word = sent[i][0]
postag = sent[i][1]
features = [
'bias',
'word.lower=' + word.lower(),
'word[-3:]=' + word[-3:],
'word[-2:]=' + word[-2:],
'word.isupper=%s' % word.isupper(),
'word.istitle=%s' % word.istitle(),
'word.isdigit=%s' % word.isdigit(),
'word.cluster=%s' % word2cluster[word.lower()] if word.lower() in word2cluster else "0",
'postag=' + postag
]
if i > 0:
word1 = sent[i-1][0]
postag1 = sent[i-1][1]
features.extend([
'-1:word.lower=' + word1.lower(),
'-1:word.istitle=%s' % word1.istitle(),
'-1:word.isupper=%s' % word1.isupper(),
'-1:postag=' + postag1
])
else:
features.append('BOS')
if i > 1:
word2 = sent[i-2][0]
postag2 = sent[i-2][1]
features.extend([
'-2:word.lower=' + word2.lower(),
'-2:word.istitle=%s' % word2.istitle(),
'-2:word.isupper=%s' % word2.isupper(),
'-2:postag=' + postag2
])
if i < len(sent)-1:
word1 = sent[i+1][0]
postag1 = sent[i+1][1]
features.extend([
'+1:word.lower=' + word1.lower(),
'+1:word.istitle=%s' % word1.istitle(),
'+1:word.isupper=%s' % word1.isupper(),
'+1:postag=' + postag1
])
else:
features.append('EOS')
if i < len(sent)-2:
word2 = sent[i+2][0]
postag2 = sent[i+2][1]
features.extend([
'+2:word.lower=' + word2.lower(),
'+2:word.istitle=%s' % word2.istitle(),
'+2:word.isupper=%s' % word2.isupper(),
'+2:postag=' + postag2
])
return features
def sent2features(sent, word2cluster):
return [word2features(sent, i, word2cluster) for i in range(len(sent))]
def sent2labels(sent):
return [label for token, postag, label in sent]
def sent2tokens(sent):
return [token for token, postag, label in sent]
word2cluster = read_clusters("data/embeddings/clusters_nl.tsv")
sent2features(train_sents[0], word2cluster)[0]
['bias',
'word.lower=de',
'word[-3:]=De',
'word[-2:]=De',
'word.isupper=False',
'word.istitle=True',
'word.isdigit=False',
'word.cluster=38',
'postag=Art',
'BOS',
'+1:word.lower=tekst',
'+1:word.istitle=False',
'+1:word.isupper=False',
'+1:postag=N',
'+2:word.lower=van',
'+2:word.istitle=False',
'+2:word.isupper=False',
'+2:postag=Prep']
X_train = [sent2features(s, word2cluster) for s in train_sents]
y_train = [sent2labels(s) for s in train_sents]
X_dev = [sent2features(s, word2cluster) for s in dev_sents]
y_dev = [sent2labels(s) for s in dev_sents]
X_test = [sent2features(s, word2cluster) for s in test_sents]
y_test = [sent2labels(s) for s in test_sents]
我们现在创建一个 CRF 模型并对其进行训练。我们将使用标准的 L-BFGS 算法进行参数估计,并运行 100 次迭代。完成后,我们将使用 joblib
保存模型。
crf = crfsuite.CRF(
verbose='true',
algorithm='lbfgs',
max_iterations=100
)
crf.fit(X_train, y_train, X_dev=X_dev, y_dev=y_dev)
loading training data to CRFsuite: 100%|██████████| 15806/15806 [00:02&lt;00:00, 7623.17it/s]
loading dev data to CRFsuite: 27%|██▋ | 769/2895 [00:00&lt;00:00, 7689.13it/s]
loading dev data to CRFsuite: 100%|██████████| 2895/2895 [00:00&lt;00:00, 7186.08it/s]
Holdout group: 2
Feature generation
type: CRF1d
feature.minfreq: 0.000000
feature.possible_states: 0
feature.possible_transitions: 0
0....1....2....3....4....5....6....7....8....9....10
Number of features: 152117
Seconds required: 0.424
L-BFGS optimization
c1: 0.000000
c2: 1.000000
num_memories: 6
max_iterations: 100
epsilon: 0.000010
stop: 10
delta: 0.000010
linesearch: MoreThuente
linesearch.max_iterations: 20
Iter 1 time=0.37 loss=104214.83 active=152117 precision=0.100 recall=0.111 F1=0.105 Acc(item/seq)=0.901 0.496 feature_norm=1.00
Iter 2 time=0.21 loss=96997.81 active=152117 precision=0.100 recall=0.111 F1=0.105 Acc(item/seq)=0.901 0.496 feature_norm=1.13
Iter 3 time=0.21 loss=92085.38 active=152117 precision=0.100 recall=0.111 F1=0.105 Acc(item/seq)=0.901 0.496 feature_norm=1.26
Iter 4 time=0.21 loss=84277.67 active=152117 precision=0.100 recall=0.111 F1=0.105 Acc(item/seq)=0.901 0.496 feature_norm=1.51
Iter 5 time=0.21 loss=67577.53 active=152117 precision=0.169 recall=0.113 F1=0.109 Acc(item/seq)=0.902 0.496 feature_norm=2.32
Iter 6 time=0.21 loss=47854.26 active=152117 precision=0.326 recall=0.347 F1=0.320 Acc(item/seq)=0.930 0.580 feature_norm=4.34
Iter 7 time=0.21 loss=43326.19 active=152117 precision=0.340 recall=0.365 F1=0.333 Acc(item/seq)=0.933 0.592 feature_norm=5.06
Iter 8 time=0.21 loss=38617.07 active=152117 precision=0.372 recall=0.399 F1=0.362 Acc(item/seq)=0.938 0.618 feature_norm=6.43
Iter 9 time=0.21 loss=35511.85 active=152117 precision=0.491 recall=0.442 F1=0.421 Acc(item/seq)=0.942 0.631 feature_norm=8.62
Iter 10 time=0.21 loss=32735.31 active=152117 precision=0.535 recall=0.456 F1=0.445 Acc(item/seq)=0.944 0.654 feature_norm=9.50
Iter 11 time=0.21 loss=31687.17 active=152117 precision=0.530 recall=0.491 F1=0.502 Acc(item/seq)=0.947 0.663 feature_norm=10.94
Iter 12 time=0.21 loss=29323.19 active=152117 precision=0.532 recall=0.482 F1=0.477 Acc(item/seq)=0.946 0.666 feature_norm=11.18
Iter 13 time=0.21 loss=28733.55 active=152117 precision=0.596 recall=0.489 F1=0.489 Acc(item/seq)=0.947 0.668 feature_norm=11.58
Iter 14 time=0.21 loss=27120.69 active=152117 precision=0.640 recall=0.507 F1=0.512 Acc(item/seq)=0.948 0.677 feature_norm=12.36
Iter 15 time=0.21 loss=24849.05 active=152117 precision=0.640 recall=0.547 F1=0.558 Acc(item/seq)=0.952 0.697 feature_norm=13.86
Iter 16 time=0.40 loss=24033.40 active=152117 precision=0.654 recall=0.580 F1=0.586 Acc(item/seq)=0.954 0.706 feature_norm=14.53
Iter 17 time=0.21 loss=22935.94 active=152117 precision=0.669 recall=0.578 F1=0.598 Acc(item/seq)=0.955 0.712 feature_norm=15.15
Iter 18 time=0.21 loss=21803.53 active=152117 precision=0.682 recall=0.584 F1=0.605 Acc(item/seq)=0.956 0.713 feature_norm=15.67
Iter 19 time=0.21 loss=21046.75 active=152117 precision=0.725 recall=0.565 F1=0.586 Acc(item/seq)=0.956 0.717 feature_norm=16.04
Iter 20 time=0.21 loss=20465.96 active=152117 precision=0.701 recall=0.566 F1=0.594 Acc(item/seq)=0.956 0.711 feature_norm=15.81
Iter 21 time=0.21 loss=19991.29 active=152117 precision=0.706 recall=0.586 F1=0.611 Acc(item/seq)=0.957 0.717 feature_norm=15.63
Iter 22 time=0.21 loss=19560.23 active=152117 precision=0.684 recall=0.597 F1=0.616 Acc(item/seq)=0.957 0.722 feature_norm=15.58
Iter 23 time=0.21 loss=19241.14 active=152117 precision=0.672 recall=0.602 F1=0.615 Acc(item/seq)=0.957 0.726 feature_norm=15.65
Iter 24 time=0.21 loss=18787.87 active=152117 precision=0.678 recall=0.627 F1=0.637 Acc(item/seq)=0.958 0.731 feature_norm=16.31
Iter 25 time=0.21 loss=18145.13 active=152117 precision=0.690 recall=0.625 F1=0.640 Acc(item/seq)=0.959 0.734 feature_norm=16.94
Iter 26 time=0.21 loss=17786.38 active=152117 precision=0.710 recall=0.621 F1=0.642 Acc(item/seq)=0.959 0.738 feature_norm=17.48
Iter 27 time=0.21 loss=17247.02 active=152117 precision=0.711 recall=0.625 F1=0.649 Acc(item/seq)=0.959 0.736 feature_norm=18.46
Iter 28 time=0.21 loss=16876.01 active=152117 precision=0.737 recall=0.627 F1=0.655 Acc(item/seq)=0.960 0.748 feature_norm=19.83
Iter 29 time=0.21 loss=16543.87 active=152117 precision=0.732 recall=0.631 F1=0.663 Acc(item/seq)=0.961 0.751 feature_norm=20.12
Iter 30 time=0.21 loss=16263.21 active=152117 precision=0.725 recall=0.644 F1=0.671 Acc(item/seq)=0.962 0.753 feature_norm=20.42
Iter 31 time=0.21 loss=15665.78 active=152117 precision=0.715 recall=0.661 F1=0.676 Acc(item/seq)=0.963 0.758 feature_norm=21.50
Iter 32 time=0.21 loss=15247.34 active=152117 precision=0.700 recall=0.641 F1=0.650 Acc(item/seq)=0.961 0.748 feature_norm=23.18
Iter 33 time=0.21 loss=14866.51 active=152117 precision=0.702 recall=0.658 F1=0.670 Acc(item/seq)=0.963 0.754 feature_norm=24.67
Iter 34 time=0.21 loss=14650.96 active=152117 precision=0.704 recall=0.659 F1=0.675 Acc(item/seq)=0.963 0.755 feature_norm=25.38
Iter 35 time=0.21 loss=14386.98 active=152117 precision=0.730 recall=0.677 F1=0.695 Acc(item/seq)=0.964 0.761 feature_norm=26.47
Iter 36 time=0.21 loss=14158.49 active=152117 precision=0.742 recall=0.681 F1=0.704 Acc(item/seq)=0.965 0.763 feature_norm=28.58
Iter 37 time=0.21 loss=13895.30 active=152117 precision=0.736 recall=0.684 F1=0.701 Acc(item/seq)=0.965 0.765 feature_norm=28.72
Iter 38 time=0.21 loss=13656.45 active=152117 precision=0.730 recall=0.683 F1=0.695 Acc(item/seq)=0.965 0.768 feature_norm=29.07
Iter 39 time=0.21 loss=13499.28 active=152117 precision=0.727 recall=0.680 F1=0.691 Acc(item/seq)=0.965 0.769 feature_norm=29.61
Iter 40 time=0.21 loss=13174.95 active=152117 precision=0.726 recall=0.677 F1=0.689 Acc(item/seq)=0.965 0.766 feature_norm=31.03
Iter 41 time=0.21 loss=13104.00 active=152117 precision=0.736 recall=0.662 F1=0.678 Acc(item/seq)=0.964 0.760 feature_norm=33.06
Iter 42 time=0.21 loss=12750.79 active=152117 precision=0.731 recall=0.685 F1=0.701 Acc(item/seq)=0.966 0.764 feature_norm=34.35
Iter 43 time=0.21 loss=12637.02 active=152117 precision=0.740 recall=0.690 F1=0.708 Acc(item/seq)=0.966 0.766 feature_norm=34.63
Iter 44 time=0.21 loss=12534.20 active=152117 precision=0.745 recall=0.692 F1=0.712 Acc(item/seq)=0.966 0.766 feature_norm=35.30
Iter 45 time=0.21 loss=12390.89 active=152117 precision=0.739 recall=0.682 F1=0.702 Acc(item/seq)=0.966 0.761 feature_norm=37.15
Iter 46 time=0.21 loss=12277.42 active=152117 precision=0.733 recall=0.689 F1=0.704 Acc(item/seq)=0.966 0.763 feature_norm=37.58
Iter 47 time=0.21 loss=12219.03 active=152117 precision=0.739 recall=0.690 F1=0.706 Acc(item/seq)=0.966 0.767 feature_norm=37.20
Iter 48 time=0.21 loss=12125.64 active=152117 precision=0.743 recall=0.695 F1=0.711 Acc(item/seq)=0.967 0.770 feature_norm=36.99
Iter 49 time=0.21 loss=11970.65 active=152117 precision=0.745 recall=0.697 F1=0.712 Acc(item/seq)=0.967 0.775 feature_norm=36.84
Iter 50 time=0.21 loss=11780.73 active=152117 precision=0.754 recall=0.696 F1=0.718 Acc(item/seq)=0.968 0.777 feature_norm=37.57
Iter 51 time=0.21 loss=11623.83 active=152117 precision=0.740 recall=0.691 F1=0.710 Acc(item/seq)=0.967 0.774 feature_norm=38.21
Iter 52 time=0.21 loss=11549.38 active=152117 precision=0.740 recall=0.690 F1=0.709 Acc(item/seq)=0.967 0.773 feature_norm=38.94
Iter 53 time=0.21 loss=11497.85 active=152117 precision=0.739 recall=0.696 F1=0.713 Acc(item/seq)=0.967 0.772 feature_norm=39.54
Iter 54 time=0.21 loss=11419.64 active=152117 precision=0.733 recall=0.692 F1=0.707 Acc(item/seq)=0.966 0.773 feature_norm=40.40
Iter 55 time=0.21 loss=11280.29 active=152117 precision=0.744 recall=0.707 F1=0.719 Acc(item/seq)=0.967 0.773 feature_norm=41.75
Iter 56 time=0.21 loss=11131.39 active=152117 precision=0.746 recall=0.710 F1=0.722 Acc(item/seq)=0.968 0.773 feature_norm=42.72
Iter 57 time=0.21 loss=11043.40 active=152117 precision=0.751 recall=0.713 F1=0.726 Acc(item/seq)=0.968 0.774 feature_norm=42.92
Iter 58 time=0.21 loss=10954.38 active=152117 precision=0.769 recall=0.713 F1=0.736 Acc(item/seq)=0.969 0.781 feature_norm=43.18
Iter 59 time=0.21 loss=10836.31 active=152117 precision=0.773 recall=0.713 F1=0.736 Acc(item/seq)=0.969 0.781 feature_norm=43.74
Iter 60 time=0.21 loss=10712.24 active=152117 precision=0.779 recall=0.719 F1=0.744 Acc(item/seq)=0.970 0.788 feature_norm=44.41
Iter 61 time=0.21 loss=10602.81 active=152117 precision=0.789 recall=0.709 F1=0.740 Acc(item/seq)=0.970 0.789 feature_norm=44.68
Iter 62 time=0.21 loss=10508.84 active=152117 precision=0.782 recall=0.711 F1=0.739 Acc(item/seq)=0.970 0.787 feature_norm=45.37
Iter 63 time=0.21 loss=10458.88 active=152117 precision=0.783 recall=0.717 F1=0.744 Acc(item/seq)=0.970 0.788 feature_norm=45.51
Iter 64 time=0.21 loss=10420.78 active=152117 precision=0.763 recall=0.711 F1=0.730 Acc(item/seq)=0.969 0.786 feature_norm=45.67
Iter 65 time=0.21 loss=10315.28 active=152117 precision=0.766 recall=0.721 F1=0.735 Acc(item/seq)=0.969 0.786 feature_norm=46.30
Iter 66 time=0.21 loss=10204.10 active=152117 precision=0.769 recall=0.728 F1=0.740 Acc(item/seq)=0.970 0.786 feature_norm=47.10
Iter 67 time=0.21 loss=10134.54 active=152117 precision=0.769 recall=0.716 F1=0.737 Acc(item/seq)=0.970 0.787 feature_norm=47.87
Iter 68 time=0.21 loss=10095.10 active=152117 precision=0.773 recall=0.718 F1=0.741 Acc(item/seq)=0.970 0.787 feature_norm=47.85
Iter 69 time=0.21 loss=10059.52 active=152117 precision=0.773 recall=0.715 F1=0.738 Acc(item/seq)=0.970 0.790 feature_norm=47.81
Iter 70 time=0.21 loss=10012.53 active=152117 precision=0.767 recall=0.712 F1=0.733 Acc(item/seq)=0.970 0.790 feature_norm=47.95
Iter 71 time=0.21 loss=9931.41 active=152117 precision=0.765 recall=0.710 F1=0.731 Acc(item/seq)=0.970 0.791 feature_norm=48.49
Iter 72 time=0.21 loss=9861.45 active=152117 precision=0.763 recall=0.709 F1=0.730 Acc(item/seq)=0.970 0.793 feature_norm=49.07
Iter 73 time=0.21 loss=9803.75 active=152117 precision=0.769 recall=0.723 F1=0.741 Acc(item/seq)=0.970 0.796 feature_norm=49.32
Iter 74 time=0.21 loss=9762.61 active=152117 precision=0.758 recall=0.720 F1=0.733 Acc(item/seq)=0.970 0.793 feature_norm=49.58
Iter 75 time=0.21 loss=9726.47 active=152117 precision=0.761 recall=0.722 F1=0.736 Acc(item/seq)=0.970 0.790 feature_norm=49.80
Iter 76 time=0.21 loss=9628.97 active=152117 precision=0.764 recall=0.722 F1=0.736 Acc(item/seq)=0.970 0.789 feature_norm=50.47
Iter 77 time=0.40 loss=9586.61 active=152117 precision=0.763 recall=0.725 F1=0.740 Acc(item/seq)=0.971 0.791 feature_norm=50.96
Iter 78 time=0.21 loss=9522.00 active=152117 precision=0.767 recall=0.723 F1=0.741 Acc(item/seq)=0.970 0.788 feature_norm=51.34
Iter 79 time=0.21 loss=9479.87 active=152117 precision=0.765 recall=0.720 F1=0.736 Acc(item/seq)=0.970 0.788 feature_norm=51.77
Iter 80 time=0.21 loss=9448.33 active=152117 precision=0.771 recall=0.720 F1=0.739 Acc(item/seq)=0.970 0.789 feature_norm=51.80
Iter 81 time=0.21 loss=9423.55 active=152117 precision=0.769 recall=0.723 F1=0.740 Acc(item/seq)=0.970 0.789 feature_norm=51.83
Iter 82 time=0.21 loss=9367.51 active=152117 precision=0.763 recall=0.720 F1=0.736 Acc(item/seq)=0.970 0.790 feature_norm=52.25
Iter 83 time=0.21 loss=9322.98 active=152117 precision=0.762 recall=0.722 F1=0.737 Acc(item/seq)=0.970 0.793 feature_norm=52.34
Iter 84 time=0.21 loss=9277.51 active=152117 precision=0.765 recall=0.725 F1=0.740 Acc(item/seq)=0.971 0.798 feature_norm=52.47
Iter 85 time=0.21 loss=9229.07 active=152117 precision=0.772 recall=0.726 F1=0.743 Acc(item/seq)=0.971 0.798 feature_norm=52.79
Iter 86 time=0.21 loss=9214.50 active=152117 precision=0.782 recall=0.729 F1=0.751 Acc(item/seq)=0.971 0.799 feature_norm=52.88
Iter 87 time=0.21 loss=9194.55 active=152117 precision=0.785 recall=0.731 F1=0.753 Acc(item/seq)=0.971 0.800 feature_norm=52.82
Iter 88 time=0.21 loss=9182.23 active=152117 precision=0.783 recall=0.728 F1=0.750 Acc(item/seq)=0.971 0.798 feature_norm=52.86
Iter 89 time=0.21 loss=9161.54 active=152117 precision=0.786 recall=0.726 F1=0.750 Acc(item/seq)=0.971 0.795 feature_norm=52.98
Iter 90 time=0.21 loss=9130.06 active=152117 precision=0.788 recall=0.728 F1=0.752 Acc(item/seq)=0.971 0.797 feature_norm=53.17
Iter 91 time=0.21 loss=9063.09 active=152117 precision=0.794 recall=0.738 F1=0.760 Acc(item/seq)=0.972 0.800 feature_norm=53.64
Iter 92 time=0.40 loss=9041.98 active=152117 precision=0.791 recall=0.739 F1=0.759 Acc(item/seq)=0.972 0.801 feature_norm=53.77
Iter 93 time=0.21 loss=9010.27 active=152117 precision=0.783 recall=0.736 F1=0.754 Acc(item/seq)=0.972 0.801 feature_norm=53.97
Iter 94 time=0.21 loss=8984.80 active=152117 precision=0.786 recall=0.740 F1=0.759 Acc(item/seq)=0.972 0.803 feature_norm=54.09
Iter 95 time=0.21 loss=8965.14 active=152117 precision=0.788 recall=0.740 F1=0.760 Acc(item/seq)=0.972 0.803 feature_norm=54.14
Iter 96 time=0.21 loss=8931.00 active=152117 precision=0.786 recall=0.737 F1=0.758 Acc(item/seq)=0.972 0.804 feature_norm=54.26
Iter 97 time=0.21 loss=8923.90 active=152117 precision=0.791 recall=0.737 F1=0.757 Acc(item/seq)=0.971 0.799 feature_norm=54.68
Iter 98 time=0.21 loss=8860.78 active=152117 precision=0.784 recall=0.735 F1=0.755 Acc(item/seq)=0.971 0.802 feature_norm=54.64
Iter 99 time=0.21 loss=8846.13 active=152117 precision=0.788 recall=0.737 F1=0.758 Acc(item/seq)=0.972 0.803 feature_norm=54.68
Iter 100 time=0.21 loss=8832.34 active=152117 precision=0.789 recall=0.737 F1=0.758 Acc(item/seq)=0.972 0.803 feature_norm=54.77
================================================
Label Precision Recall F1 Support
------- ----------- -------- ----- ---------
B-LOC 0.823 0.823 0.823 479
B-MISC 0.806 0.651 0.720 748
B-ORG 0.853 0.620 0.718 686
B-PER 0.752 0.855 0.800 703
I-LOC 0.583 0.547 0.565 64
I-MISC 0.645 0.507 0.568 215
I-ORG 0.806 0.684 0.740 396
I-PER 0.844 0.946 0.892 423
O 0.990 0.998 0.994 33973
------------------------------------------------
L-BFGS terminated with the maximum number of iterations
Total seconds required for training: 21.768
Storing the model
Number of active features: 152117 (152117)
Number of active attributes: 130306 (145602)
Number of active labels: 9 (9)
Writing labels
Writing attributes
Writing feature references for transitions
Writing feature references for attributes
Seconds required: 0.058
CRF(algorithm='lbfgs', all_possible_states=None,
all_possible_transitions=None, averaging=None, c=None, c1=None, c2=None,
calibration_candidates=None, calibration_eta=None,
calibration_max_trials=None, calibration_rate=None,
calibration_samples=None, delta=None, epsilon=None, error_sensitive=None,
gamma=None, keep_tempfiles=None, linesearch=None, max_iterations=100,
max_linesearch=None, min_freq=None, model_filename=None,
num_memories=None, pa_type=None, period=None, trainer_cls=None,
variance=None, verbose='true')
import joblib
import os
OUTPUT_PATH = "models/ner/"
OUTPUT_FILE = "crf_model"
if not os.path.exists(OUTPUT_PATH):
os.mkdir(OUTPUT_PATH)
joblib.dump(crf, os.path.join(OUTPUT_PATH, OUTPUT_FILE))
['models/ner/crf_model']
让我们评估 CRF 的输出。我们将从上面的输出文件加载模型,并让它预测完整测试集的标签。
作为健全性检查,让我们看一下它对第一个测试句子的预测。此输出看起来非常好:CRF 能够正确地预测句子中的所有四个位置。它只错过了人物实体,这本身就是一个奇怪的案例,因为它实际上不是人名。
crf = joblib.load(os.path.join(OUTPUT_PATH, OUTPUT_FILE))
y_pred = crf.predict(X_test)
example_sent = test_sents[0]
print("Sentence:", ' '.join(sent2tokens(example_sent)))
print("Predicted:", ' '.join(crf.predict([sent2features(example_sent, word2cluster)])[0]))
print("Correct: ", ' '.join(sent2labels(example_sent)))
Sentence: Dat is in Italië , Spanje of Engeland misschien geen probleem , maar volgens ' Der Kaiser ' in Duitsland wel .
Predicted: O O O B-LOC O B-LOC O B-LOC O O O O O O O B-MISC I-MISC O O B-LOC O O
Correct: O O O B-LOC O B-LOC O B-LOC O O O O O O O B-PER I-PER O O B-LOC O O
现在,我们在完整的测试集上进行评估。我们将打印所有标签(除了 O
)的分类报告。如果我们包含 O
,它在我们的数据中远远超过实体标签,那么平均分数将被人工夸大,仅仅因为 CRF 的 O
标签本就具有很高的正确概率。我们在所有实体类型中获得了 77% 的平均 F 分数(微平均),其中 B-LOC
和 B-PER
的结果特别好。
labels = list(crf.classes_)
labels.remove("O")
y_pred = crf.predict(X_test)
sorted_labels = sorted(
labels,
key=lambda name: (name[1:], name[0])
)
print(metrics.flat_classification_report(y_test, y_pred, labels=sorted_labels))
precision recall f1-score support
B-LOC 0.83 0.83 0.83 774
I-LOC 0.29 0.41 0.34 49
B-MISC 0.84 0.61 0.71 1187
I-MISC 0.59 0.42 0.49 410
B-ORG 0.80 0.69 0.74 882
I-ORG 0.74 0.66 0.70 551
B-PER 0.80 0.90 0.85 1098
I-PER 0.87 0.95 0.91 807
micro avg 0.80 0.74 0.77 5758
macro avg 0.72 0.68 0.70 5758
weighted avg 0.80 0.74 0.76 5758
现在,我们还可以查看 CRF 识别出的最可能的转换,以及每个标签的顶部特征。我们将使用 eli5
库来完成此操作,它可以帮助我们解释机器学习模型的预测。
顶部转换非常直观:最可能的转换是在同一实体类型内(从 B 标签到 O 标签),以及 B 标签位于 O 标签之后的那些转换。
特征也有道理。例如,如果单词不以大写字母开头,它不太可能是实体。相反,如果单词以 ië
结尾,它很可能是一个位置,这确实是荷兰语中位置的常见后缀。还要注意嵌入集群的信息量有多大:对于所有实体类型,单词集群构成 CRF 的一些最具信息量的特征。
import eli5
eli5.show_weights(crf, top=30)
从 \ 到 | O | B-LOC | I-LOC | B-MISC | I-MISC | B-ORG | I-ORG | B-PER | I-PER |
---|---|---|---|---|---|---|---|---|---|
O | 4.141 | 4.583 | 0.0 | 4.141 | 0.0 | 4.366 | 0.0 | 3.819 | 0.0 |
B-LOC | -0.248 | -0.279 | 7.101 | 0.0 | 0.0 | 0.0 | 0.0 | -0.661 | 0.0 |
I-LOC | -1.062 | -0.235 | 5.967 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
B-MISC | -0.985 | 0.655 | 0.0 | -0.316 | 7.73 | 0.551 | 0.0 | 0.46 | 0.0 |
I-MISC | -1.781 | 0.0 | 0.0 | -0.382 | 7.769 | 1.145 | 0.0 | -0.719 | 0.0 |
B-ORG | -0.261 | 0.0 | 0.0 | -0.809 | 0.0 | 0.0 | 7.803 | 0.106 | 0.0 |
I-ORG | -0.794 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 7.174 | 0.084 | 0.0 |
B-PER | 0.31 | -0.346 | 0.0 | -0.611 | 0.0 | 0.0 | 0.0 | -1.408 | 8.68 |
I-PER | 0.104 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 6.804 |
y=O 顶部特征 | y=B-LOC 顶部特征 | y=I-LOC 顶部特征 | y=B-MISC 顶部特征 | y=I-MISC 顶部特征 | y=B-ORG 顶部特征 | y=I-ORG 顶部特征 | y=B-PER 顶部特征 | y=I-PER 顶部特征 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
|
|
|
|
|
|
|
到目前为止,我们已经使用默认参数训练了一个模型。这些参数不太可能让我们获得最佳性能。因此,我们将通过迭代训练不同的模型并对其进行评估来自动搜索最佳超参数设置。最终,我们将选择最佳模型。
这里我们将重点关注两个参数:c1
和 c2
。它们分别是 L1 和 L2 正则化的参数。正则化通过在损失函数中添加惩罚来防止对训练数据的过拟合。在 L1 正则化中,这种惩罚是权重绝对值的总和;在 L2 正则化中,它是权重平方值的总和。L1 正则化执行一种特征选择,因为它将 0 权重分配给不相关的特征。相比之下,L2 正则化使不相关特征的权重变小,但不一定为零。L1 正则化通常称为 Lasso 方法,L2 称为 Ridge 方法,而两者的线性组合称为弹性网络正则化。
我们定义了 c1 和 c2 的参数空间,并使用 flat F1 分数来比较各个模型。我们将依靠三折交叉验证来对 50 个候选者中的每一个进行评分。我们使用随机搜索,这意味着我们不会尝试所有指定的参数设置,而是让该过程从我们在参数空间中指定的分布中随机采样。它将进行 50 次 (n_iter
)。这个过程需要一段时间,但值得等待。
import scipy
from sklearn.metrics import make_scorer
from sklearn.model_selection import RandomizedSearchCV
crf = crfsuite.CRF(
algorithm='lbfgs',
max_iterations=100,
all_possible_transitions=True
)
params_space = {
'c1': scipy.stats.expon(scale=0.5),
'c2': scipy.stats.expon(scale=0.05),
}
f1_scorer = make_scorer(metrics.flat_f1_score,
average='weighted', labels=labels)
rs = RandomizedSearchCV(crf, params_space,
cv=3,
verbose=1,
n_jobs=-1,
n_iter=50,
scoring=f1_scorer)
rs.fit(X_train, y_train)
Fitting 3 folds for each of 50 candidates, totalling 150 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done 18 tasks | elapsed: 3.2min
[Parallel(n_jobs=-1)]: Done 150 out of 150 | elapsed: 24.3min finished
RandomizedSearchCV(cv=3, error_score='raise-deprecating',
estimator=CRF(algorithm='lbfgs', all_possible_states=None,
all_possible_transitions=True, averaging=None, c=None, c1=None, c2=None,
calibration_candidates=None, calibration_eta=None,
calibration_max_trials=None, calibration_rate=None,
calibration_samples=None, delta=None, epsilon=None, error...e,
num_memories=None, pa_type=None, period=None, trainer_cls=None,
variance=None, verbose=False),
fit_params=None, iid='warn', n_iter=50, n_jobs=-1,
param_distributions={'c1': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7f9947f04e10>, 'c2': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7f9947f04c88>},
pre_dispatch='2*n_jobs', random_state=None, refit=True,
return_train_score='warn',
scoring=make_scorer(flat_f1_score, average=weighted, labels=['B-ORG', 'B-MISC', 'B-PER', 'I-PER', 'B-LOC', 'I-MISC', 'I-ORG', 'I-LOC']),
verbose=1)
让我们看一下最佳超参数设置。我们的随机搜索建议使用 L1 和 L2 规范化的组合。
print('best params:', rs.best_params_)
print('best CV score:', rs.best_score_)
print('model size: {:0.2f}M'.format(rs.best_estimator_.size_ / 1000000))
best params: {'c1': 0.08869645933566639, 'c2': 0.005642379370340676}
best CV score: 0.7608794798691931
model size: 1.06M
为了找出这将转化为多少精度、召回率和 F1 分数,我们从随机搜索中获取最佳估计器,并将其在测试集上进行评估。这确实显示出与我们初始模型相比有了不错的改进。我们的平均 F1 分数从 77% 提高到了 79.1%。精度和召回率都得到了提高,我们看到了所有四种实体类型的积极结果。
best_crf = rs.best_estimator_
y_pred = best_crf.predict(X_test)
print(metrics.flat_classification_report(
y_test, y_pred, labels=sorted_labels, digits=3
))
precision recall f1-score support
B-LOC 0.849 0.863 0.856 774
I-LOC 0.359 0.571 0.441 49
B-MISC 0.847 0.622 0.717 1187
I-MISC 0.664 0.415 0.511 410
B-ORG 0.806 0.727 0.764 882
I-ORG 0.772 0.677 0.721 551
B-PER 0.834 0.903 0.867 1098
I-PER 0.892 0.958 0.924 807
micro avg 0.823 0.761 0.791 5758
macro avg 0.753 0.717 0.725 5758
weighted avg 0.820 0.761 0.784 5758
自神经网络模型出现以来,条件随机场已经失去了部分人气。尽管如此,它们对于命名实体识别仍然非常有效,特别是在考虑词嵌入信息时。