In previous tutorial, we use python difflib library to compute the similarity of two sentences, here is detail.
However, we also can use python gensim library to compute their similarity, in this tutorial, we will tell you how to do.
In this example, we will use gensim to load a word2vec trainning model to get word embeddings then calculate the cosine similarity of two sentences.
Load word2vec embeddings file
model = gensim.models.KeyedVectors.load_word2vec_format('yelp-2013-embedding-200d.txt', binary=False)
We can get each word embeddings from word2vec embeddings file in sentence, then we will get the sentence embeddings.
Create two senteces
sen_1 = "i love this book" sen_2 = 'this book is my favorite'
To compare with python difflib library, we use two some sentences.
How to get sentence embeddings?
In this example, we will average each word embeddings in sentence to get sentence embeddings.
Notice: This is a simple method, but not a good one. Because each word may contribute different semantic in sentence.
Calculate cosine similarity of two sentence
sen_1_words = [w for w in sen_1.split() if w in model.vocab] sen_2_words = [w for w in sen_2.split() if w in model.vocab] sim = model.n_similarity(sen_1_words, sen_2_words) print(sim)
Firstly, we split a sentence into a word list, then compute their cosine similarity. The similarity is:
As to python difflib library, the similarity is: 0.75. However, 0.75 < 0.839574928046, which means gensim is better than python difflib library.
Meanwhile, if you want to compute the similarity of two words with gensim, you can read this tutorial.