Introduction
Word embedding is a mapping from words to vector representations of the words. The method that represents a word as a vector or array of numbers related in someway to counts is called vector semantics. Ideally, the geometry of the vectors will capture the semantic and syntactic meaning of the words—for example, words similar in meaning should have representations which are close to each other in the vector space.
In this project, a word co-occurrence matrix of 10,000 words from Wikipedia corpus with 1.5 billion words will be used to study the two method learned from class. Entry of the matrix denotes the number of times in the corpus this the ith and jth words occur within 5 words of each other The data file in this project is downloaded from Standford CS168 course website. The dictionary in this data set is sorted by frequency, thus the first row of the co-occurrence matrix is the most frequent word.
data file
Method
Sparse Vector Representation
The co-occurrence matrix in represented each cell by the raw frequency of the co-occurrence of two words. The raw frequency in a matrix may be skewed. Pointwise mutual information PPMI is a good measure for association between words which can tell us how much often the two words occur.
The pointwise mutual information is a measure of how often two events x and y occur, compared with what we would expect if they were independent:
PMI between two words is:
Positive PMI(PPMI) replace all negative values of PMI by 0, which eliminate the problem of unreliable negative PMI values.
For infrequent events, PPMI is bias which would have high PMI values for very rare word. There are two solution for this problem, one is to give rare word slightly higher probabilities, one is add-one smoothing.
In this project, the PPMI with add-two smoothing is used to build a model which can compute the similar target words of a given context words. In this project, the similarity of two words will be defined as the cosine-similarity between their embeddings.
Then, analogy analysis is performed in this model. Analogy analysis is by given two words with some relation and a third word, find the fourth word which has the similar relation with the third word as the relation between the first two words.
For example, if given ‘man’, ‘woman’, ‘king’, the fourth word should be ‘queen’.
Cosine Similarity
Given two word vector, we can measure the similarity of these two vectoe by the inner product or dot product of them.
The dot product is high when two vectors have large values in same dimension and it value is low when they are orthogonal vectors with complementary distributions. However, a problem exist in this metric, dot product will be longer is these two vectors are longer, the length of a vector is
This means that more frequent words will have higher dot product. The solution for this problem is dividing the dot product by the length of two vectors,which is the $cosine$ of the angle bwtween these two vectors.
Dense Vector Representation
Short and dense vectors have a number of potential advantages. Short vectors may be easier to use as features in machine learning. Dense vectors may generalize better than storing explicit counts. In addition, dense vectors may perform better in capturing synonymy than sparse vectors.
Singular Value Decomposition(SVD) is a classic method for generating dense vectors, which is applied to a task of generating embedding from term document matrices in a model called Latent Semantic Analysis (LSA).
SVD factorizes any a rectangular matrix $X$ into the product of three matrices $W, s, C$.
For $W$, its rows corresponding to original matrix but columns represents a dimension in a new latent space, such that the column vectors are orthogonal to each other. Columns are ordered by the amount of variance in the dataset each new dimension accounts for.
$S$ is a diagonal matrix of singular values expressing the importance of each dimension.$C$ represents documents or contexts, but each row now represents one of the new latent dimensions and the m row vectors are orthogonal to each other.
In this project, SVD is applied to the term-term matrix, let $M$ denote the$10000 \times 10000$ word co-occurrence matrix. After SVD, only 100 dimension is kept by keeping the top 100 singular values. Then, a low rank approximation matric is got. But only the truncated $W$ matrix is used as word embedding.
The word embedding is used to compute the similarity of words in the data set of this project as in the PMI method. After that, Analogy analysis is also performed in this method.
Result and Discussion
Result of Weighted Terms Presentation: PPMI
After applying add two smooth to the co-occurrence matrix $M$, PPMI matrix is computed. Then this PMI matrix is used as the word embeddings in this data set.
For a given word, the corresponding vector can be retrieved from the PMI matrix by the index of this word. Then, we can compute the cosine similarity between this word and any other words. The increasing of the value of cosine similarity indicates the more relevance of two words.
For the similar word experiment, several word as inputs are given, and then the PPMI model is applied to search the 10 most similar words. The result of this experiment is shown as below.
imput | output(top 10 most similar words in PPMI) |
---|---|
'water' | 'water', 'liquid', 'oxygen', 'gas', 'fuel', 'fluid', 'soil', 'drainage', 'sand', 'waste' |
'boy' | 'boy', 'girl', 'kid', 'baby', 'man', 'crazy', 'woman', 'dog', 'cat', 'daddy' |
'apple' | 'apple', 'microsoft', 'ibm', 'linux', 'proprietary', 'os', 'compatible', 'processor', 'console', 'sony' |
'fruit' | 'fruit', 'fruits', 'flowers', 'vegetables', 'meat', 'corn', 'wheat', 'flower', 'seeds', 'milk' |
'cat' | 'cat', 'dog', 'rabbit', 'monkey', 'pig', 'rat', 'cow', 'duck', 'cats', 'frog' |
'beautiful' | 'beautiful', 'pretty', 'wonderful', 'attractive', 'lonely', 'nice', 'magnificent', 'sweet', 'funny', 'sad' |
'good' | 'good', 'bad', 'excellent', 'better', 'poor', 'nice', 'sure', 'wrong', 'my', 'wonderful' |
'many' | 'many', 'several', 'numerous', 'some', 'other', 'various', 'including', 'these', 'such', 'few' |
'blue' | 'blue', 'red', 'yellow', 'purple', 'black', 'pink', 'white', 'green', 'orange', 'colored' |
'play' | 'play', 'playing', 'plays', 'played', 'compete', 'perform', 'sing', 'participate', 'players', 'game' |
'try' | 'try', 'trying', 'tried', 'tries', 'attempting', 'attempt', 'able', 'attempts', 'help', 'attempted' |
'walk' | 'walk', 'walking', 'walks', 'walked', 'climb', 'swim', 'ride', 'throw', 'sit', 'wait' |
From the result of similar words above, we can see that the PPMI do a good job in searching similar words for the given inputs. The similar words getting from PPMI are all relevant to the input word. There are some interesting output such as the output the ‘good’, the most similar word except itself is ‘bad’. Although We may think that these two words are opposite but not similar, for the given content these two words are used in most similar surrounding. That is to say they are syntactically similar. Thus, the PPMI will treat them as similar. The word embedding from PPMI matrix capture both the syntactic and semantic meaning of a word.
After the similarity experiment, analogy experiment was also perform by using the word embeding from PPMI. An analogy question— “man is to woman as king is to __”, where the goal of this analogy task is to fill in the blank space.
This question can be solved by finding a word whose word embedding is most closest to
.
The result for this problem using cosine similarity and PPMI is as below.1
2
3
4
5'man to woman' is similar as 'king to prince'
'boy to girl' is similar as 'king to queen'
'beijing to china' is similar as 'paris to france'
'king to queen' is similar as 'ibm to microsoft'
'korea to seoul' is similar as 'thailand to bangkok'
It seems that, the PPMI does a good job in the problem except the first analogy. Then, an experiment through a test file which contains more than 5585 analogy samples to text the accurary of analogy is conducted. After a long time computation, the accuracy of PPMI in analogy is as below. It only get $41.6\%$ accuracy in this analogy test. It seems that the accuracy of word analogy by PPMI word embedding is quite low. Besides, the computation complexity of this 10000* 10000 matrix is very high, which makes the process slow dowm.
Result of Dense vector via SVD
In this part, SVD is applied to decomposite the co-occurrence matrix in order to get a short and dense vector for word embedding. The plot of the singular value in this SVD decomposition is show as below.
imput | output(top 10 most similar words in PPMI using SVD) |
---|---|
'water' | 'water', 'liquid', 'oxygen', 'gas', 'fuel', 'fluid', 'soil', 'drainage', 'sand', 'waste' |
'boy' | 'boy', 'girl', 'kid', 'baby', 'man', 'crazy', 'woman', 'dog', 'cat', 'daddy' |
'apple' | 'apple', 'microsoft', 'ibm', 'linux', 'proprietary', 'os', 'compatible', 'processor', 'console', 'sony' |
'fruit' | 'fruit', 'fruits', 'flowers', 'vegetables', 'meat', 'corn', 'wheat', 'flower', 'seeds', 'milk' |
'cat' | 'cat', 'dog', 'rabbit', 'monkey', 'pig', 'rat', 'cow', 'duck', 'cats', 'frog' |
'beautiful' | 'beautiful', 'pretty', 'wonderful', 'attractive', 'lonely', 'nice', 'magnificent', 'sweet', 'funny', 'sad' |
'good' | 'good', 'bad', 'excellent', 'better', 'poor', 'nice', 'sure', 'wrong', 'my', 'wonderful' |
'many' | 'many', 'several', 'numerous', 'some', 'other', 'various', 'including', 'these', 'such', 'few' |
'blue' | 'blue', 'red', 'yellow', 'purple', 'black', 'pink', 'white', 'green', 'orange', 'colored' |
'play' | 'play', 'playing', 'plays', 'played', 'compete', 'perform', 'sing', 'participate', 'players', 'game' |
'try' | 'try', 'trying', 'tried', 'tries', 'attempting', 'attempt', 'able', 'attempts', 'help', 'attempted' |
'walk' | 'walk', 'walking', 'walks', 'walked', 'climb', 'swim', 'ride', 'throw', 'sit', 'wait' |
From the result, we can see that co-occurrence matrix with SVD also get a well job in searching the similar words.
After the similarity experiment, analogy experiment was also perform by using the word embedding from truncated co-occurrence matrix.
The result for this problem using cosine similarity and Dense vector is as below.
From the result, we can see that the dense vector perform well in this analogy problem. It gets the right result for the analogy problem given.1
2
3
4
5'man to woman' is similar as 'king to queen'
'boy to girl' is similar as 'king to queen'
'beijing to china' is similar as 'paris to france'
'king to queen' is similar as 'ibm to microsoft'
'korea to seoul' is similar as 'thailand to bangkok'
As the PPMI model, an experiment through a test file which contains more than 5585 analogy samples to text the accurary of analogy is conducted. The accuracy using dense vector is $54\%$, which is much better than the sparse vector PPMI method.
Conclusion
From all the result of the two method, we know that the dense vector method get a better result than the sparse PPMI method in analogy analysis and similar word search. In addition, the computational efficiency of the dense vector is also better than the PPMI. Short vectors may be easier to use as features in machine learning. Dense vectors may generalize better than storing explicit counts. In addition, dense vectors may perform better in capturing synonymy than sparse vectors.