Introduction
In this post, we will train a Hidden Markov Model for part of speech tagging. The first 10K tagged sentence of ‘news’ in the brown corpus will be used to train the Hidden Markov Model. A sentence from ‘alice’ will be infered by this Hidden Markov Model.
Extract data
1 | import nltk |
Compress the tags
The tags provided by brown corpus is to detail, it can be compressed in some extent.
Detelte the postfix of the generated tags, as followed. All the dashes and everything behind them are removed and all the asterisks are also removed, while the $ is kept as its original way.
NN-TL -> NN
BEZ -> BEZ
FW- -> FW
:-HL -> :
After this process, the tags like “NN-TL” and “NN-JJ” become “NN” without its postfix and “JJ-TL” will become “JJ”.
Before compressing the tags, the total number of tags is 218 and after compressing, the number of tags decrease to 88. Although it will have some effect on the precision of tags but it will improve the accurary of the output prediction.1
2
3
4
5
6
7
8
9
10
11
12
13
14sents_temp = []
for i in range(len(sents)):
sents_revised_list = []
for j in range(len(sents[i])):
content = list(sents[i][j])
#content[1]= re.sub("([\$]*[\-\+]+[\w\$\*]+)+$|\$+$|(?<=\w)\*+$", "", content[1]) ## remove the $ version
content[1]= re.sub("([\-\+]+[\w\$\*]+)+$|(?<=\w)\*+$", "", content[1]) ## keep the $ version
content[1]= re.sub("(?<=\w)\*$", "", content[1])
content_tuple = tuple(content)
sents_revised_list.append(content_tuple)
sents_temp.append(sents_revised_list)
temp = sents
sents = sents_temp
sents_temp = temp
Generate a set of tags and a set of words
- python can maintain a set without duplicated elements. After the sets of tags and words is built, a list of tags and a list of words which contains all the elements in the set are generated. And then dictionarys for tags index and word index are generated.
- The tags list and words list are used to get a tag or a word via its index. And the tags dictionanry and words dictionary are used to get their index.
- In the tags list and tags set, the tags of start state and end state are added.
1 | sents_list = [t for s in sents for t in s] |
1 | print("Tags after compressing:") |
Tags after compressing:
{‘START_STATE’: 0, ‘NPS’: 1, ‘WPO’: 2, ‘PPLS’: 3, ‘BEN’: 4, ‘NNS’: 5, ‘BER’: 6, ‘VBN’: 7, ‘CS’: 8, ‘TO’: 9, ‘DTX’: 10, ‘)’: 11, ‘,’: 12, ‘NP’: 13, ‘VBD’: 14, ‘FW’: 15, ‘PP$$’: 16, ‘JJR’: 17, ‘WPS’: 18, ‘OD’: 19, ‘QLP’: 20, ‘AT’: 21, ‘EX’: 22, ‘CC’: 23, ‘HVD’: 24, ‘PPS’: 25, ‘PP$’: 26, ‘NR$’: 27, ‘``’: 28, ‘QL’: 29, ‘JJS’: 30, ‘BE’: 31, ‘HVN’: 32, ‘PN$’: 33, ‘ABN’: 34, ‘NN$’: 35, ‘AP’: 36, ‘NNS$’: 37, ‘RB$’: 38, ‘VBZ’: 39, ‘NR’: 40, ‘PPL’: 41, ‘NP$’: 42, ‘DT$’: 43, ‘BEDZ’: 44, ‘RBT’: 45, ‘MD’: 46, ‘DT’: 47, ‘NN’: 48, ‘ABL’: 49, ‘BEM’: 50, ‘BED’: 51, ‘AP$’: 52, ‘HV’: 53, ‘(‘: 54, ‘WDT’: 55, ‘DTS’: 56, ‘RP’: 57, ‘VBG’: 58, ‘HVG’: 59, ‘NPS$’: 60, ‘BEZ’: 61, ‘JJT’: 62, ‘.’: 63, “‘’”: 64, ‘DOZ’: 65, ‘ABX’: 66, ‘CD’: 67, ‘DOD’: 68, ‘RB’: 69, ‘DO’: 70, ‘BEG’: 71, ‘RBR’: 72, “‘“: 73, ‘—‘: 74, ‘PPSS’: 75, ‘WQL’: 76, ‘*’: 77, ‘IN’: 78, ‘VB’: 79, ‘PPO’: 80, ‘HVZ’: 81, ‘WP$’: 82, ‘WRB’: 83, ‘PN’: 84, ‘CD$’: 85, ‘JJ’: 86, ‘UH’: 87, ‘DTI’: 88, ‘:’: 89, ‘END_STATE’: 90}
Forward Algorithm
Transition Probability Matrix: A
- The row length and column length both are the same as the length of tags
- In order to reduce the effect of the zero values in this matrix, all elements in matrix A was added by a small value 1e-10 and then re-normalized.
1 | import numpy as np |
State Observation Likelihood Matrix: B
- The row of matrix B is the tags generated, the column of matrix B is the words in our trainnign set.
- In order to reduce the effect of the zero values in this matrix, all elements in matrix B was added by a small value 1e-10 and then re-normalized.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19B_row = len(tags_hash)
B_col = len(words_hash)
B = np.zeros((B_row, B_col))
for i in range(len(sents_list)):
row_index = tags_hash[sents_list[i][1]]
col_index = words_hash[sents_list[i][0]]
B[row_index][col_index] += 1
B_col_sum = np.sum(B, axis = 0)
B_col_sum = B_col_sum.reshape(1, B_col)
B_normalized = B / B_col_sum
## re-normalized
B_normalized += 1e-10
B_col_sum = np.sum(B_normalized, axis = 0)
B_col_sum = B_col_sum.reshape(1, B_col)
B_normalized = B_normalized / B_col_sum
Get a test case
In this post, we use a sentence from ‘alice’1
2
3from nltk.tokenize import word_tokenize
test_case = nltk.corpus.gutenberg.raw('carroll-alice.txt')[91:683]
words_test = word_tokenize(test_case)
Forward Algorithm
The cell below is the implementation of forward algorithm. The matrix ‘forward’ is the porbablity of states in the test sequence.1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26num_state = len(tags_hash)
num_obv = len(words_test)
forward = np.zeros((num_obv, num_state))
start_state = tags_hash['START_STATE']
end_state = tags_hash['END_STATE']
try:
word_index = words_hash[words_test[0]]
for y in range(num_state):
forward[0][y] = A_normalized[start_state][y] * B_normalized[y][word_index]
except KeyError:
for y in range(num_state):
forward[0][y] = A_normalized[start_state][y] * (1/num_state)
for t in range(1, num_obv):
try:
word_index = words_hash[words_test[t]]
for y in range(num_state):
for y_pre in range(num_state):
forward[t][y] += forward[t-1][y_pre] * A_normalized[y_pre][y] * B_normalized[y][word_index]
except KeyError:
for y in range(num_state):
for y_pre in range(num_state):
forward[t][y] += forward[t-1][y_pre] * A_normalized[y_pre][y] * (1/num_state)
Decoding of Forward Algorithm
For a given forward matrix, the state in a step can be predicted by finding a state with largest probability in this step.1
2
3
4
5
6
7
8
9forward_output = []
for t in range(num_obv):
mx = -1
tag_id = 0
for index, tags in enumerate(forward[t]):
if mx < forward[t][index]:
mx = forward[t][index]
tag_id = index
forward_output.append((words_test[t], tags_list[tag_id]))
The result of forward algorithm is shown as belowed.
Because the tags are compressed, all the tags in the results are without profix.1
2print ("The result of test case from forward algorithms is as belowed")
print(forward_output)
1 | The result of test case from forward algorithms is as belowed |
Viterbi Algorithm
- The implementation of viterbi algorithm is given as below.
- The Node class is a class with previous node and next node which can be used as a back pointer in viterbi algorithms.
1 | class Node: |
1 | num_state = len(tags_hash) |
The result of viterbi algorithm is shown as belowed
Because the tags are compressed, all the tags in the results are without profix.1
2
3
4
5
6
7
8
9
10
11node = end_node
outlist = []
while True:
if (node is None):
break
outlist.append((node.observ, node.state))
node = node.pre_node
outlist.reverse()
print("The result of test case from Viterbi Algorithms is as belowed")
print(outlist[1:len(outlist)-1])
1 | The result of test case from forward algorithms is as belowed |