Hidden Markov Model for Part of Speech Tagging

machine learning natural language processing

 2017/09/23

Introduction

In this post, we will train a Hidden Markov Model for part of speech tagging. The first 10K tagged sentence of ‘news’ in the brown corpus will be used to train the Hidden Markov Model. A sentence from ‘alice’ will be infered by this Hidden Markov Model.

Extract data

import nltk
import re
from nltk.tag.sequential import UnigramTagger
from nltk.corpus import brown
from nltk.tokenize import TreebankWordTokenizer

alice = nltk.corpus.gutenberg.raw('carroll-alice.txt')
tagger = UnigramTagger(brown.tagged_sents(categories='news')[:10000])
tokens = TreebankWordTokenizer().tokenize(alice[:1000])
tags = tagger.tag(tokens)
sents = brown.tagged_sents(categories='news')[:10000]
words = [w[0] for s in sents for w in s]

Compress the tags

The tags provided by brown corpus is to detail, it can be compressed in some extent.
Detelte the postfix of the generated tags, as followed. All the dashes and everything behind them are removed and all the asterisks are also removed, while the $ is kept as its original way.

NN-TL -> NN
BEZ -> BEZ
FW- -> FW
:-HL -> :

After this process, the tags like “NN-TL” and “NN-JJ” become “NN” without its postfix and “JJ-TL” will become “JJ”.
Before compressing the tags, the total number of tags is 218 and after compressing, the number of tags decrease to 88. Although it will have some effect on the precision of tags but it will improve the accurary of the output prediction.

sents_temp = []
for i in range(len(sents)):
    sents_revised_list = []
    for j in range(len(sents[i])):
        content = list(sents[i][j])
        #content[1]= re.sub("([\$]*[\-\+]+[\w\$\*]+)+$|\$+$|(?<=\w)\*+$", "", content[1]) ## remove the $ version
        content[1]= re.sub("([\-\+]+[\w\$\*]+)+$|(?<=\w)\*+$", "", content[1]) ## keep the $ version
        content[1]= re.sub("(?<=\w)\*$", "", content[1])
        content_tuple = tuple(content)
        sents_revised_list.append(content_tuple)
    sents_temp.append(sents_revised_list)
temp = sents
sents = sents_temp
sents_temp = temp

Generate a set of tags and a set of words

python can maintain a set without duplicated elements. After the sets of tags and words is built, a list of tags and a list of words which contains all the elements in the set are generated. And then dictionarys for tags index and word index are generated.
The tags list and words list are used to get a tag or a word via its index. And the tags dictionanry and words dictionary are used to get their index.
In the tags list and tags set, the tags of start state and end state are added.

sents_list = [t for s in sents for t in s]
tags_set = set()
words_set = set()
for i in range(len(sents_list)):
    tags_set.add(sents_list[i][1])
    words_set.add(sents_list[i][0])
tags_list = list(tags_set)
words_list = list(words_set)

## add the start and end state
tags_list.insert(0, 'START_STATE')
tags_list.insert(len(tags_list), 'END_STATE')

tags_hash = {}
for index, tag in enumerate(tags_list):
    tags_hash[tag] = index
words_hash = {}
for index, word in enumerate(words_list):
    words_hash[word] = index

1 2	print("Tags after compressing:") print(tags_hash)

Tags after compressing:
{‘START_STATE’: 0, ‘NPS’: 1, ‘WPO’: 2, ‘PPLS’: 3, ‘BEN’: 4, ‘NNS’: 5, ‘BER’: 6, ‘VBN’: 7, ‘CS’: 8, ‘TO’: 9, ‘DTX’: 10, ‘)’: 11, ‘,’: 12, ‘NP’: 13, ‘VBD’: 14, ‘FW’: 15, ‘PP$$’: 16, ‘JJR’: 17, ‘WPS’: 18, ‘OD’: 19, ‘QLP’: 20, ‘AT’: 21, ‘EX’: 22, ‘CC’: 23, ‘HVD’: 24, ‘PPS’: 25, ‘PP$’: 26, ‘NR$’: 27, ‘``’: 28, ‘QL’: 29, ‘JJS’: 30, ‘BE’: 31, ‘HVN’: 32, ‘PN$’: 33, ‘ABN’: 34, ‘NN$’: 35, ‘AP’: 36, ‘NNS$’: 37, ‘RB$’: 38, ‘VBZ’: 39, ‘NR’: 40, ‘PPL’: 41, ‘NP$’: 42, ‘DT$’: 43, ‘BEDZ’: 44, ‘RBT’: 45, ‘MD’: 46, ‘DT’: 47, ‘NN’: 48, ‘ABL’: 49, ‘BEM’: 50, ‘BED’: 51, ‘AP$’: 52, ‘HV’: 53, ‘(‘: 54, ‘WDT’: 55, ‘DTS’: 56, ‘RP’: 57, ‘VBG’: 58, ‘HVG’: 59, ‘NPS$’: 60, ‘BEZ’: 61, ‘JJT’: 62, ‘.’: 63, “‘’”: 64, ‘DOZ’: 65, ‘ABX’: 66, ‘CD’: 67, ‘DOD’: 68, ‘RB’: 69, ‘DO’: 70, ‘BEG’: 71, ‘RBR’: 72, “‘“: 73, ‘—‘: 74, ‘PPSS’: 75, ‘WQL’: 76, ‘*’: 77, ‘IN’: 78, ‘VB’: 79, ‘PPO’: 80, ‘HVZ’: 81, ‘WP$’: 82, ‘WRB’: 83, ‘PN’: 84, ‘CD$’: 85, ‘JJ’: 86, ‘UH’: 87, ‘DTI’: 88, ‘:’: 89, ‘END_STATE’: 90}

Forward Algorithm

Transition Probability Matrix: A

The row length and column length both are the same as the length of tags
In order to reduce the effect of the zero values in this matrix, all elements in matrix A was added by a small value 1e-10 and then re-normalized.

import numpy as np
A_len = len(tags_hash)
A = np.zeros((A_len, A_len))

start_state = tags_hash['START_STATE']
end_state = tags_hash['END_STATE']
for i in range(len(sents)):
    
    start_tag_index = tags_hash[sents[i][0][1]]
    A[start_state][start_tag_index] += 1
    
    for j in range(len(sents[i]) - 1):
        row_index = tags_hash[sents[i][j][1]]
        col_index = tags_hash[sents[i][j + 1][1]]
        A[row_index][col_index] += 1
    end_tag_index = tags_hash[sents[i][len(sents[i]) - 1][1]]
    A[end_tag_index][end_state] += 1

A[A_len - 1] += 1e-10
A_row_sum = np.sum(A, axis = 1)
A_row_sum = A_row_sum.reshape(A.shape[0],1)
A_normalized = A / A_row_sum
## re-normalized
A_normalized += 1e-10
A_row_sum = np.sum(A_normalized, axis = 1)
A_row_sum = A_row_sum.reshape(A_normalized.shape[0],1)
A_normalized = A_normalized / A_row_sum

State Observation Likelihood Matrix: B

The row of matrix B is the tags generated, the column of matrix B is the words in our trainnign set.

In order to reduce the effect of the zero values in this matrix, all elements in matrix B was added by a small value 1e-10 and then re-normalized.

B_row = len(tags_hash)
B_col = len(words_hash)
B = np.zeros((B_row, B_col))

for i in range(len(sents_list)):
    row_index = tags_hash[sents_list[i][1]]
    col_index = words_hash[sents_list[i][0]]
    B[row_index][col_index] += 1


B_col_sum = np.sum(B, axis = 0)
B_col_sum = B_col_sum.reshape(1, B_col)
B_normalized = B / B_col_sum

## re-normalized
B_normalized += 1e-10
B_col_sum = np.sum(B_normalized, axis = 0)
B_col_sum = B_col_sum.reshape(1, B_col)
B_normalized = B_normalized / B_col_sum

Get a test case

In this post, we use a sentence from ‘alice’

1
2
3

from nltk.tokenize import word_tokenize
test_case = nltk.corpus.gutenberg.raw('carroll-alice.txt')[91:683]
words_test = word_tokenize(test_case)

Forward Algorithm

The cell below is the implementation of forward algorithm. The matrix ‘forward’ is the porbablity of states in the test sequence.

num_state = len(tags_hash)
num_obv = len(words_test)
forward = np.zeros((num_obv, num_state))
start_state = tags_hash['START_STATE']
end_state = tags_hash['END_STATE']

try:
    word_index = words_hash[words_test[0]]
    for y in range(num_state):
        forward[0][y] = A_normalized[start_state][y] * B_normalized[y][word_index]
        
except KeyError:
    for y in range(num_state):
        forward[0][y] = A_normalized[start_state][y] * (1/num_state)
        
for t in range(1, num_obv):
    try:
        word_index = words_hash[words_test[t]]
        for y in range(num_state):
            for y_pre in range(num_state):
                forward[t][y] += forward[t-1][y_pre] * A_normalized[y_pre][y] * B_normalized[y][word_index]
            
    except KeyError:
        for y in range(num_state):
            for y_pre in range(num_state):
                forward[t][y] += forward[t-1][y_pre] * A_normalized[y_pre][y] * (1/num_state)

Decoding of Forward Algorithm

For a given forward matrix, the state in a step can be predicted by finding a state with largest probability in this step.

forward_output = []
for t in range(num_obv):
    mx = -1
    tag_id = 0
    for index, tags in enumerate(forward[t]):
        if mx < forward[t][index]:
            mx = forward[t][index]
            tag_id = index
    forward_output.append((words_test[t], tags_list[tag_id]))

The result of forward algorithm is shown as belowed.

Because the tags are compressed, all the tags in the results are without profix.

1 2	print ("The result of test case from forward algorithms is as belowed") print(forward_output)

1
2

The result of test case from forward algorithms is as belowed
[('Alice', 'NP'), ('was', 'BEDZ'), ('beginning', 'NN'), ('to', 'IN'), ('get', 'VB'), ('very', 'QL'), ('tired', 'VBN'), ('of', 'IN'), ('sitting', 'VBG'), ('by', 'IN'), ('her', 'PP$'), ('sister', 'NN'), ('on', 'IN'), ('the', 'AT'), ('bank', 'NN'), (',', ','), ('and', 'CC'), ('of', 'IN'), ('having', 'HVG'), ('nothing', 'PN'), ('to', 'IN'), ('do', 'DO'), (':', ':'), ('once', 'RB'), ('or', 'CC'), ('twice', 'RB'), ('she', 'PPS'), ('had', 'HVD'), ('peeped', 'VBN'), ('into', 'IN'), ('the', 'AT'), ('book', 'NN'), ('her', 'PP$'), ('sister', 'NN'), ('was', 'BEDZ'), ('reading', 'VBG'), (',', ','), ('but', 'CC'), ('it', 'PPS'), ('had', 'HVD'), ('no', 'AT'), ('pictures', 'NNS'), ('or', 'CC'), ('conversations', 'NNS'), ('in', 'IN'), ('it', 'PPO'), (',', ','), ("'and", 'NP'), ('what', 'WDT'), ('is', 'BEZ'), ('the', 'AT'), ('use', 'NN'), ('of', 'IN'), ('a', 'AT'), ('book', 'NN'), (',', ','), ("'", "'"), ('thought', 'VBN'), ('Alice', 'NP'), ("'without", 'NP'), ('pictures', 'NNS'), ('or', 'CC'), ('conversation', 'NN'), ('?', '.'), ("'", "'"), ('So', 'RB'), ('she', 'PPS'), ('was', 'BEDZ'), ('considering', 'VBG'), ('in', 'IN'), ('her', 'PP$'), ('own', 'JJ'), ('mind', 'NN'), ('(', '('), ('as', 'CS'), ('well', 'RB'), ('as', 'CS'), ('she', 'PPS'), ('could', 'MD'), (',', ','), ('for', 'IN'), ('the', 'AT'), ('hot', 'JJ'), ('day', 'NN'), ('made', 'VBD'), ('her', 'PP$'), ('feel', 'NN'), ('very', 'QL'), ('sleepy', 'JJ'), ('and', 'CC'), ('stupid', 'NP'), (')', ')'), (',', ','), ('whether', 'CS'), ('the', 'AT'), ('pleasure', 'NN'), ('of', 'IN'), ('making', 'VBG'), ('a', 'AT'), ('daisy-chain', 'NN'), ('would', 'MD'), ('be', 'BE'), ('worth', 'JJ'), ('the', 'AT'), ('trouble', 'NN'), ('of', 'IN'), ('getting', 'VBG'), ('up', 'RP'), ('and', 'CC'), ('picking', 'VBG'), ('the', 'AT'), ('daisies', 'NN'), (',', ','), ('when', 'WRB'), ('suddenly', 'RB'), ('a', 'AT'), ('White', 'JJ'), ('Rabbit', 'NN'), ('with', 'IN'), ('pink', 'JJ'), ('eyes', 'NNS'), ('ran', 'VBD'), ('close', 'RB'), ('by', 'IN'), ('her', 'PP$'), ('.', '.')]

Viterbi Algorithm

The implementation of viterbi algorithm is given as below.
The Node class is a class with previous node and next node which can be used as a back pointer in viterbi algorithms.

class Node:
    def __init__(self, state, observ, val, pre_node, next_node):
        self.state = state
        self.observ = observ
        self.val = val
        self.pre_node = pre_node
        self.next_node = next_node
    def __repr__(self):
        return self.state + "-->" + self.observ + "-->" + str(self.val) + '\n'
    def __lt__(self, other):
        return (self.val < other.val)

num_state = len(tags_hash)
num_obv = len(words_test)
#viterbi = [[None for i in range(num_state)] for j in range(num_obv)]
viterbi = []
path = []

start_state = tags_hash['START_STATE']
end_state = tags_hash['END_STATE']

start_node = Node("START_STATE", "", 0, None, None)

try:
    word_index = words_hash[words_test[0]]
    curr_step = []
    for y in range(1, num_state-1):
        val = A_normalized[start_state][y] * B_normalized[y][word_index]
        curr_node = Node(tags_list[y], words_test[0], val, start_node, None)
        #print(curr_node)
        curr_step.append(curr_node)
    viterbi.append(curr_step)
    
except KeyError:
    curr_step = []
    for y in range(1, num_state-1):
        val = A_normalized[start_state][y]  * (1/(num_state-2))
        curr_node = Node(tags_list[y], words_test[0], val, start_node, None)
        curr_step.append(curr_node)
    viterbi.append(curr_step)

## 
for t in range(1, num_obv):
    try:
        word_index = words_hash[words_test[t]]
        curr_step = []
        for y in range(1,num_state-1):
            val = -1
            prenode = None
            for y_pre in range(1,num_state-1):
                temp_val = viterbi[t-1][y_pre-1].val * A_normalized[y_pre][y] * B_normalized[y][word_index]
                if val < temp_val :
                    val = temp_val
                    prenode = viterbi[t-1][y_pre-1]
            curr_node = Node(tags_list[y], words_test[t], val, prenode, None)
            curr_step.append(curr_node)
        viterbi.append(curr_step)
            
    except KeyError:
        curr_step = []
        for y in range(1, num_state-1):
            val = -1
            prenode = None
            for y_pre in range(1, num_state-1):
                temp_val = viterbi[t-1][y_pre-1].val * A_normalized[y_pre][y] * (1/(num_state-2))
                if val < temp_val :
                    val = temp_val
                    prenode = viterbi[t-1][y_pre-1]
            curr_node = Node(tags_list[y], words_test[t], val, prenode, None)
            curr_step.append(curr_node)
        viterbi.append(curr_step)

assert(len(viterbi) == num_obv)


val = -1
prenode = None
for y_pre in range(1, num_state-1):
    temp_val = viterbi[num_obv-1][y_pre-1].val * A_normalized[y_pre][end_state]
    if val < temp_val:
        val = temp_val
        prenode = viterbi[num_obv-1][y_pre-1]
end_node = Node("END_STATE", "", val, prenode, None)

The result of viterbi algorithm is shown as belowed

Because the tags are compressed, all the tags in the results are without profix.

node = end_node
outlist = []
while True:
    if (node is None):
        break
    outlist.append((node.observ, node.state))
    node = node.pre_node

outlist.reverse()
print("The result of test case from Viterbi Algorithms is as belowed")
print(outlist[1:len(outlist)-1])

1
2

The result of test case from forward algorithms is as belowed
[('Alice', 'NP'), ('was', 'BEDZ'), ('beginning', 'NN'), ('to', 'TO'), ('get', 'VB'), ('very', 'QL'), ('tired', 'VBN'), ('of', 'IN'), ('sitting', 'VBG'), ('by', 'IN'), ('her', 'PP$'), ('sister', 'NN'), ('on', 'IN'), ('the', 'AT'), ('bank', 'NN'), (',', ','), ('and', 'CC'), ('of', 'IN'), ('having', 'HVG'), ('nothing', 'PN'), ('to', 'TO'), ('do', 'DO'), (':', '*'), ('once', 'RB'), ('or', 'CC'), ('twice', 'RB'), ('she', 'PPS'), ('had', 'HVD'), ('peeped', 'VBN'), ('into', 'IN'), ('the', 'AT'), ('book', 'NN'), ('her', 'PP$'), ('sister', 'NN'), ('was', 'BEDZ'), ('reading', 'NN'), (',', ','), ('but', 'CC'), ('it', 'PPS'), ('had', 'HVD'), ('no', 'AT'), ('pictures', 'NNS'), ('or', 'CC'), ('conversations', 'NNS'), ('in', 'IN'), ('it', 'PPO'), (',', ','), ("'and", 'IN'), ('what', 'WDT'), ('is', 'BEZ'), ('the', 'AT'), ('use', 'NN'), ('of', 'IN'), ('a', 'AT'), ('book', 'NN'), (',', ','), ("'", "'"), ('thought', 'VBN'), ('Alice', 'NP'), ("'without", 'CD'), ('pictures', 'NNS'), ('or', 'CC'), ('conversation', 'NN'), ('?', '.'), ("'", "'"), ('So', 'RB'), ('she', 'PPS'), ('was', 'BEDZ'), ('considering', 'VBG'), ('in', 'IN'), ('her', 'PP$'), ('own', 'JJ'), ('mind', 'NN'), ('(', '('), ('as', 'CS'), ('well', 'RB'), ('as', 'CS'), ('she', 'PPS'), ('could', 'MD'), (',', ','), ('for', 'IN'), ('the', 'AT'), ('hot', 'JJ'), ('day', 'NN'), ('made', 'VBD'), ('her', 'PP$'), ('feel', 'NN'), ('very', 'QL'), ('sleepy', 'JJ'), ('and', 'CC'), ('stupid', 'NP'), (')', ')'), (',', ','), ('whether', 'CS'), ('the', 'AT'), ('pleasure', 'NN'), ('of', 'IN'), ('making', 'VBG'), ('a', 'AT'), ('daisy-chain', 'NN'), ('would', 'MD'), ('be', 'BE'), ('worth', 'JJ'), ('the', 'AT'), ('trouble', 'NN'), ('of', 'IN'), ('getting', 'VBG'), ('up', 'RP'), ('and', 'CC'), ('picking', 'VBG'), ('the', 'AT'), ('daisies', 'NN'), (',', ','), ('when', 'WRB'), ('suddenly', 'RB'), ('a', 'AT'), ('White', 'JJ'), ('Rabbit', 'NN'), ('with', 'IN'), ('pink', 'JJ'), ('eyes', 'NNS'), ('ran', 'VBD'), ('close', 'RB'), ('by', 'IN'), ('her', 'PPO'), ('.', '.')]

CATALOG

1. Introduction
1. 1.1. Extract data
  1. 1.1.1. Compress the tags
2. 1.2. Generate a set of tags and a set of words
2. Forward Algorithm
3. Viterbi Algorithm
1. 3.1. The result of viterbi algorithm is shown as belowed



缺失模块。
1、请确保node版本大于6.2
2、在博客根目录（注意不是archer根目录）执行以下命令：
npm i hexo-generator-json-content --save
3、在根目录_config.yml里添加配置：

jsonContent:
  meta: false
  pages: false
  posts:
    title: true
    date: true
    path: true
    text: false
    raw: false
    content: false
    slug: false
    updated: false
    comments: false
    link: false
    permalink: false
    excerpt: false
    categories: false
    tags: true