Shengbin's Studio.

Hidden Markov Model for Part of Speech Tagging

2017/09/23

Introduction

In this post, we will train a Hidden Markov Model for part of speech tagging. The first 10K tagged sentence of ‘news’ in the brown corpus will be used to train the Hidden Markov Model. A sentence from ‘alice’ will be infered by this Hidden Markov Model.

Extract data

1
2
3
4
5
6
7
8
9
10
11
12
import nltk
import re
from nltk.tag.sequential import UnigramTagger
from nltk.corpus import brown
from nltk.tokenize import TreebankWordTokenizer

alice = nltk.corpus.gutenberg.raw('carroll-alice.txt')
tagger = UnigramTagger(brown.tagged_sents(categories='news')[:10000])
tokens = TreebankWordTokenizer().tokenize(alice[:1000])
tags = tagger.tag(tokens)
sents = brown.tagged_sents(categories='news')[:10000]
words = [w[0] for s in sents for w in s]
Compress the tags

The tags provided by brown corpus is to detail, it can be compressed in some extent.
Detelte the postfix of the generated tags, as followed. All the dashes and everything behind them are removed and all the asterisks are also removed, while the $ is kept as its original way.

NN-TL -> NN
BEZ -> BEZ
FW-
-> FW
:-HL -> :

After this process, the tags like “NN-TL” and “NN-JJ” become “NN” without its postfix and “JJ-TL” will become “JJ”.
Before compressing the tags, the total number of tags is 218 and after compressing, the number of tags decrease to 88. Although it will have some effect on the precision of tags but it will improve the accurary of the output prediction.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
sents_temp = []
for i in range(len(sents)):
sents_revised_list = []
for j in range(len(sents[i])):
content = list(sents[i][j])
#content[1]= re.sub("([\$]*[\-\+]+[\w\$\*]+)+$|\$+$|(?<=\w)\*+$", "", content[1]) ## remove the $ version
content[1]= re.sub("([\-\+]+[\w\$\*]+)+$|(?<=\w)\*+$", "", content[1]) ## keep the $ version
content[1]= re.sub("(?<=\w)\*$", "", content[1])
content_tuple = tuple(content)
sents_revised_list.append(content_tuple)
sents_temp.append(sents_revised_list)
temp = sents
sents = sents_temp
sents_temp = temp

Generate a set of tags and a set of words

  • python can maintain a set without duplicated elements. After the sets of tags and words is built, a list of tags and a list of words which contains all the elements in the set are generated. And then dictionarys for tags index and word index are generated.
  • The tags list and words list are used to get a tag or a word via its index. And the tags dictionanry and words dictionary are used to get their index.
  • In the tags list and tags set, the tags of start state and end state are added.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
sents_list = [t for s in sents for t in s]
tags_set = set()
words_set = set()
for i in range(len(sents_list)):
tags_set.add(sents_list[i][1])
words_set.add(sents_list[i][0])
tags_list = list(tags_set)
words_list = list(words_set)

## add the start and end state
tags_list.insert(0, 'START_STATE')
tags_list.insert(len(tags_list), 'END_STATE')

tags_hash = {}
for index, tag in enumerate(tags_list):
tags_hash[tag] = index
words_hash = {}
for index, word in enumerate(words_list):
words_hash[word] = index
1
2
print("Tags after compressing:")
print(tags_hash)

Tags after compressing:
{‘START_STATE’: 0, ‘NPS’: 1, ‘WPO’: 2, ‘PPLS’: 3, ‘BEN’: 4, ‘NNS’: 5, ‘BER’: 6, ‘VBN’: 7, ‘CS’: 8, ‘TO’: 9, ‘DTX’: 10, ‘)’: 11, ‘,’: 12, ‘NP’: 13, ‘VBD’: 14, ‘FW’: 15, ‘PP$$’: 16, ‘JJR’: 17, ‘WPS’: 18, ‘OD’: 19, ‘QLP’: 20, ‘AT’: 21, ‘EX’: 22, ‘CC’: 23, ‘HVD’: 24, ‘PPS’: 25, ‘PP$’: 26, ‘NR$’: 27, ‘``’: 28, ‘QL’: 29, ‘JJS’: 30, ‘BE’: 31, ‘HVN’: 32, ‘PN$’: 33, ‘ABN’: 34, ‘NN$’: 35, ‘AP’: 36, ‘NNS$’: 37, ‘RB$’: 38, ‘VBZ’: 39, ‘NR’: 40, ‘PPL’: 41, ‘NP$’: 42, ‘DT$’: 43, ‘BEDZ’: 44, ‘RBT’: 45, ‘MD’: 46, ‘DT’: 47, ‘NN’: 48, ‘ABL’: 49, ‘BEM’: 50, ‘BED’: 51, ‘AP$’: 52, ‘HV’: 53, ‘(‘: 54, ‘WDT’: 55, ‘DTS’: 56, ‘RP’: 57, ‘VBG’: 58, ‘HVG’: 59, ‘NPS$’: 60, ‘BEZ’: 61, ‘JJT’: 62, ‘.’: 63, “‘’”: 64, ‘DOZ’: 65, ‘ABX’: 66, ‘CD’: 67, ‘DOD’: 68, ‘RB’: 69, ‘DO’: 70, ‘BEG’: 71, ‘RBR’: 72, “‘“: 73, ‘—‘: 74, ‘PPSS’: 75, ‘WQL’: 76, ‘*’: 77, ‘IN’: 78, ‘VB’: 79, ‘PPO’: 80, ‘HVZ’: 81, ‘WP$’: 82, ‘WRB’: 83, ‘PN’: 84, ‘CD$’: 85, ‘JJ’: 86, ‘UH’: 87, ‘DTI’: 88, ‘:’: 89, ‘END_STATE’: 90}

Forward Algorithm

Transition Probability Matrix: A

  • The row length and column length both are the same as the length of tags
  • In order to reduce the effect of the zero values in this matrix, all elements in matrix A was added by a small value 1e-10 and then re-normalized.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import numpy as np
A_len = len(tags_hash)
A = np.zeros((A_len, A_len))

start_state = tags_hash['START_STATE']
end_state = tags_hash['END_STATE']
for i in range(len(sents)):

start_tag_index = tags_hash[sents[i][0][1]]
A[start_state][start_tag_index] += 1

for j in range(len(sents[i]) - 1):
row_index = tags_hash[sents[i][j][1]]
col_index = tags_hash[sents[i][j + 1][1]]
A[row_index][col_index] += 1
end_tag_index = tags_hash[sents[i][len(sents[i]) - 1][1]]
A[end_tag_index][end_state] += 1

A[A_len - 1] += 1e-10
A_row_sum = np.sum(A, axis = 1)
A_row_sum = A_row_sum.reshape(A.shape[0],1)
A_normalized = A / A_row_sum
## re-normalized
A_normalized += 1e-10
A_row_sum = np.sum(A_normalized, axis = 1)
A_row_sum = A_row_sum.reshape(A_normalized.shape[0],1)
A_normalized = A_normalized / A_row_sum

State Observation Likelihood Matrix: B

  • The row of matrix B is the tags generated, the column of matrix B is the words in our trainnign set.
  • In order to reduce the effect of the zero values in this matrix, all elements in matrix B was added by a small value 1e-10 and then re-normalized.
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    B_row = len(tags_hash)
    B_col = len(words_hash)
    B = np.zeros((B_row, B_col))

    for i in range(len(sents_list)):
    row_index = tags_hash[sents_list[i][1]]
    col_index = words_hash[sents_list[i][0]]
    B[row_index][col_index] += 1


    B_col_sum = np.sum(B, axis = 0)
    B_col_sum = B_col_sum.reshape(1, B_col)
    B_normalized = B / B_col_sum

    ## re-normalized
    B_normalized += 1e-10
    B_col_sum = np.sum(B_normalized, axis = 0)
    B_col_sum = B_col_sum.reshape(1, B_col)
    B_normalized = B_normalized / B_col_sum

Get a test case

In this post, we use a sentence from ‘alice’

1
2
3
from nltk.tokenize import word_tokenize
test_case = nltk.corpus.gutenberg.raw('carroll-alice.txt')[91:683]
words_test = word_tokenize(test_case)

Forward Algorithm

The cell below is the implementation of forward algorithm. The matrix ‘forward’ is the porbablity of states in the test sequence.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
num_state = len(tags_hash)
num_obv = len(words_test)
forward = np.zeros((num_obv, num_state))
start_state = tags_hash['START_STATE']
end_state = tags_hash['END_STATE']

try:
word_index = words_hash[words_test[0]]
for y in range(num_state):
forward[0][y] = A_normalized[start_state][y] * B_normalized[y][word_index]

except KeyError:
for y in range(num_state):
forward[0][y] = A_normalized[start_state][y] * (1/num_state)

for t in range(1, num_obv):
try:
word_index = words_hash[words_test[t]]
for y in range(num_state):
for y_pre in range(num_state):
forward[t][y] += forward[t-1][y_pre] * A_normalized[y_pre][y] * B_normalized[y][word_index]

except KeyError:
for y in range(num_state):
for y_pre in range(num_state):
forward[t][y] += forward[t-1][y_pre] * A_normalized[y_pre][y] * (1/num_state)

Decoding of Forward Algorithm

For a given forward matrix, the state in a step can be predicted by finding a state with largest probability in this step.

1
2
3
4
5
6
7
8
9
forward_output = []
for t in range(num_obv):
mx = -1
tag_id = 0
for index, tags in enumerate(forward[t]):
if mx < forward[t][index]:
mx = forward[t][index]
tag_id = index
forward_output.append((words_test[t], tags_list[tag_id]))

The result of forward algorithm is shown as belowed.

Because the tags are compressed, all the tags in the results are without profix.

1
2
print ("The result of test case from forward algorithms is as belowed")
print(forward_output)

1
2
The result of test case from forward algorithms is as belowed
[('Alice', 'NP'), ('was', 'BEDZ'), ('beginning', 'NN'), ('to', 'IN'), ('get', 'VB'), ('very', 'QL'), ('tired', 'VBN'), ('of', 'IN'), ('sitting', 'VBG'), ('by', 'IN'), ('her', 'PP$'), ('sister', 'NN'), ('on', 'IN'), ('the', 'AT'), ('bank', 'NN'), (',', ','), ('and', 'CC'), ('of', 'IN'), ('having', 'HVG'), ('nothing', 'PN'), ('to', 'IN'), ('do', 'DO'), (':', ':'), ('once', 'RB'), ('or', 'CC'), ('twice', 'RB'), ('she', 'PPS'), ('had', 'HVD'), ('peeped', 'VBN'), ('into', 'IN'), ('the', 'AT'), ('book', 'NN'), ('her', 'PP$'), ('sister', 'NN'), ('was', 'BEDZ'), ('reading', 'VBG'), (',', ','), ('but', 'CC'), ('it', 'PPS'), ('had', 'HVD'), ('no', 'AT'), ('pictures', 'NNS'), ('or', 'CC'), ('conversations', 'NNS'), ('in', 'IN'), ('it', 'PPO'), (',', ','), ("'and", 'NP'), ('what', 'WDT'), ('is', 'BEZ'), ('the', 'AT'), ('use', 'NN'), ('of', 'IN'), ('a', 'AT'), ('book', 'NN'), (',', ','), ("'", "'"), ('thought', 'VBN'), ('Alice', 'NP'), ("'without", 'NP'), ('pictures', 'NNS'), ('or', 'CC'), ('conversation', 'NN'), ('?', '.'), ("'", "'"), ('So', 'RB'), ('she', 'PPS'), ('was', 'BEDZ'), ('considering', 'VBG'), ('in', 'IN'), ('her', 'PP$'), ('own', 'JJ'), ('mind', 'NN'), ('(', '('), ('as', 'CS'), ('well', 'RB'), ('as', 'CS'), ('she', 'PPS'), ('could', 'MD'), (',', ','), ('for', 'IN'), ('the', 'AT'), ('hot', 'JJ'), ('day', 'NN'), ('made', 'VBD'), ('her', 'PP$'), ('feel', 'NN'), ('very', 'QL'), ('sleepy', 'JJ'), ('and', 'CC'), ('stupid', 'NP'), (')', ')'), (',', ','), ('whether', 'CS'), ('the', 'AT'), ('pleasure', 'NN'), ('of', 'IN'), ('making', 'VBG'), ('a', 'AT'), ('daisy-chain', 'NN'), ('would', 'MD'), ('be', 'BE'), ('worth', 'JJ'), ('the', 'AT'), ('trouble', 'NN'), ('of', 'IN'), ('getting', 'VBG'), ('up', 'RP'), ('and', 'CC'), ('picking', 'VBG'), ('the', 'AT'), ('daisies', 'NN'), (',', ','), ('when', 'WRB'), ('suddenly', 'RB'), ('a', 'AT'), ('White', 'JJ'), ('Rabbit', 'NN'), ('with', 'IN'), ('pink', 'JJ'), ('eyes', 'NNS'), ('ran', 'VBD'), ('close', 'RB'), ('by', 'IN'), ('her', 'PP$'), ('.', '.')]

Viterbi Algorithm

  • The implementation of viterbi algorithm is given as below.
  • The Node class is a class with previous node and next node which can be used as a back pointer in viterbi algorithms.
1
2
3
4
5
6
7
8
9
10
11
class Node:
def __init__(self, state, observ, val, pre_node, next_node):
self.state = state
self.observ = observ
self.val = val
self.pre_node = pre_node
self.next_node = next_node
def __repr__(self):
return self.state + "-->" + self.observ + "-->" + str(self.val) + '\n'
def __lt__(self, other):
return (self.val < other.val)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
num_state = len(tags_hash)
num_obv = len(words_test)
#viterbi = [[None for i in range(num_state)] for j in range(num_obv)]
viterbi = []
path = []

start_state = tags_hash['START_STATE']
end_state = tags_hash['END_STATE']

start_node = Node("START_STATE", "", 0, None, None)

try:
word_index = words_hash[words_test[0]]
curr_step = []
for y in range(1, num_state-1):
val = A_normalized[start_state][y] * B_normalized[y][word_index]
curr_node = Node(tags_list[y], words_test[0], val, start_node, None)
#print(curr_node)
curr_step.append(curr_node)
viterbi.append(curr_step)

except KeyError:
curr_step = []
for y in range(1, num_state-1):
val = A_normalized[start_state][y] * (1/(num_state-2))
curr_node = Node(tags_list[y], words_test[0], val, start_node, None)
curr_step.append(curr_node)
viterbi.append(curr_step)

##
for t in range(1, num_obv):
try:
word_index = words_hash[words_test[t]]
curr_step = []
for y in range(1,num_state-1):
val = -1
prenode = None
for y_pre in range(1,num_state-1):
temp_val = viterbi[t-1][y_pre-1].val * A_normalized[y_pre][y] * B_normalized[y][word_index]
if val < temp_val :
val = temp_val
prenode = viterbi[t-1][y_pre-1]
curr_node = Node(tags_list[y], words_test[t], val, prenode, None)
curr_step.append(curr_node)
viterbi.append(curr_step)

except KeyError:
curr_step = []
for y in range(1, num_state-1):
val = -1
prenode = None
for y_pre in range(1, num_state-1):
temp_val = viterbi[t-1][y_pre-1].val * A_normalized[y_pre][y] * (1/(num_state-2))
if val < temp_val :
val = temp_val
prenode = viterbi[t-1][y_pre-1]
curr_node = Node(tags_list[y], words_test[t], val, prenode, None)
curr_step.append(curr_node)
viterbi.append(curr_step)

assert(len(viterbi) == num_obv)


val = -1
prenode = None
for y_pre in range(1, num_state-1):
temp_val = viterbi[num_obv-1][y_pre-1].val * A_normalized[y_pre][end_state]
if val < temp_val:
val = temp_val
prenode = viterbi[num_obv-1][y_pre-1]
end_node = Node("END_STATE", "", val, prenode, None)

The result of viterbi algorithm is shown as belowed

Because the tags are compressed, all the tags in the results are without profix.

1
2
3
4
5
6
7
8
9
10
11
node = end_node
outlist = []
while True:
if (node is None):
break
outlist.append((node.observ, node.state))
node = node.pre_node

outlist.reverse()
print("The result of test case from Viterbi Algorithms is as belowed")
print(outlist[1:len(outlist)-1])

1
2
The result of test case from forward algorithms is as belowed
[('Alice', 'NP'), ('was', 'BEDZ'), ('beginning', 'NN'), ('to', 'TO'), ('get', 'VB'), ('very', 'QL'), ('tired', 'VBN'), ('of', 'IN'), ('sitting', 'VBG'), ('by', 'IN'), ('her', 'PP$'), ('sister', 'NN'), ('on', 'IN'), ('the', 'AT'), ('bank', 'NN'), (',', ','), ('and', 'CC'), ('of', 'IN'), ('having', 'HVG'), ('nothing', 'PN'), ('to', 'TO'), ('do', 'DO'), (':', '*'), ('once', 'RB'), ('or', 'CC'), ('twice', 'RB'), ('she', 'PPS'), ('had', 'HVD'), ('peeped', 'VBN'), ('into', 'IN'), ('the', 'AT'), ('book', 'NN'), ('her', 'PP$'), ('sister', 'NN'), ('was', 'BEDZ'), ('reading', 'NN'), (',', ','), ('but', 'CC'), ('it', 'PPS'), ('had', 'HVD'), ('no', 'AT'), ('pictures', 'NNS'), ('or', 'CC'), ('conversations', 'NNS'), ('in', 'IN'), ('it', 'PPO'), (',', ','), ("'and", 'IN'), ('what', 'WDT'), ('is', 'BEZ'), ('the', 'AT'), ('use', 'NN'), ('of', 'IN'), ('a', 'AT'), ('book', 'NN'), (',', ','), ("'", "'"), ('thought', 'VBN'), ('Alice', 'NP'), ("'without", 'CD'), ('pictures', 'NNS'), ('or', 'CC'), ('conversation', 'NN'), ('?', '.'), ("'", "'"), ('So', 'RB'), ('she', 'PPS'), ('was', 'BEDZ'), ('considering', 'VBG'), ('in', 'IN'), ('her', 'PP$'), ('own', 'JJ'), ('mind', 'NN'), ('(', '('), ('as', 'CS'), ('well', 'RB'), ('as', 'CS'), ('she', 'PPS'), ('could', 'MD'), (',', ','), ('for', 'IN'), ('the', 'AT'), ('hot', 'JJ'), ('day', 'NN'), ('made', 'VBD'), ('her', 'PP$'), ('feel', 'NN'), ('very', 'QL'), ('sleepy', 'JJ'), ('and', 'CC'), ('stupid', 'NP'), (')', ')'), (',', ','), ('whether', 'CS'), ('the', 'AT'), ('pleasure', 'NN'), ('of', 'IN'), ('making', 'VBG'), ('a', 'AT'), ('daisy-chain', 'NN'), ('would', 'MD'), ('be', 'BE'), ('worth', 'JJ'), ('the', 'AT'), ('trouble', 'NN'), ('of', 'IN'), ('getting', 'VBG'), ('up', 'RP'), ('and', 'CC'), ('picking', 'VBG'), ('the', 'AT'), ('daisies', 'NN'), (',', ','), ('when', 'WRB'), ('suddenly', 'RB'), ('a', 'AT'), ('White', 'JJ'), ('Rabbit', 'NN'), ('with', 'IN'), ('pink', 'JJ'), ('eyes', 'NNS'), ('ran', 'VBD'), ('close', 'RB'), ('by', 'IN'), ('her', 'PPO'), ('.', '.')]
CATALOG
  1. 1. Introduction
    1. 1.1. Extract data
      1. 1.1.1. Compress the tags
    2. 1.2. Generate a set of tags and a set of words
  2. 2. Forward Algorithm
    1. 2.1. Transition Probability Matrix: A
    2. 2.2. State Observation Likelihood Matrix: B
    3. 2.3. Get a test case
    4. 2.4. Forward Algorithm
    5. 2.5. Decoding of Forward Algorithm
    6. 2.6. The result of forward algorithm is shown as belowed.
  3. 3. Viterbi Algorithm
    1. 3.1. The result of viterbi algorithm is shown as belowed