Python gensim library can load word2vec model to read word embeddings and compute word similarity, in this tutorial, we will introduce how to do for nlp beginners.
Create a word2vec bin or text file
You should use some text to train a word embeddings file using word2vec, it has two types: binary or text. You can read this tutorial to learn how to do.
Best Practice to Create Word Embeddings Using Word2Vec – Word2Vec Tutorial
Install python gensim
You should install python gensim library then you can use it to load word2vec embeddings file.
Install Python Gensim with Anaconda on Windows 10: A Beginner Guide – Gensim Tutorial
Import library
# -*- coding: utf-8 -*- import gensim
Load word2vc embeddings file
We should load word2vec embeddings file, then we can read a word embedding to compute similarity.
If your word2vec file is binary, you can do like:
model = gensim.models.KeyedVectors.load_word2vec_format('yelp-2013-embedding-200d.bin', binary=True)
If file is text, you can load it by:
model = gensim.models.KeyedVectors.load_word2vec_format('yelp-2013-embedding-200d.txt', binary=False)
where word2vec embeddings file is yelp-2013-embedding-200d.txt.
Get word index in vocabulary
To get the index of a word in vocabulary, we can use this code.
#get word vocabulary vab = model.vocab word = vab['bad'] print(word.index)
Then you will find the index of word “bad” is 216 in vocabulary.
Get word by word index
We also can get word by its index in vocabulary. For example:
w = model.index2word(216) print(w)
We can file the word is “bad” by word index 216.
Compute the similarity of two words
We can compute the similarity of two words by cosine distance, here is an example:
sim = model.similarity('love', 'bad') print("sim = " + str(sim))
From the result, we can find the similarity (cosine distance) of words “love” and “bad” is:
sim = 0.163886218155
Get word embeddings
We can get the embeddings of a word easily.
vec = model.word_vec('bad') print(vec) print(type(vec))
Then we can get the word embeddings of word “bad” is:
[ -2.96425015e-01 -3.69928002e-01 1.06517002e-01 -1.85122997e-01 -1.12859998e-02 -2.23900005e-01 3.68850008e-02 -2.12399997e-02 -1.75759997e-02 3.26476008e-01 5.16830012e-02 -7.16490000e-02 ... -3.25680003e-02 3.51186007e-01 -2.08217993e-01 1.31810000e-02 1.08323999e-01 1.91893995e-01 -2.82000005e-02 2.78019998e-02 2.08480999e-01 -3.19326997e-01 -5.16390018e-02 -7.68799987e-03]
The type of vec is: <class ‘numpy.ndarray’>
Of course, you also can get “bad” word embeddings by a simple way:
vec = model['bad'] print(vec)
The result is also the same.
Notice: if word is not in vocabulary, it will raise an error, for example:
vec = model.word_vec('badsdfadafdfawwww') print(vec)
It will raise: KeyError: “word ‘badsdfadafdfawwww’ not in vocabulary”
Get top N similar words of a word
If you want to get top n similar words of a word ‘bad‘, you can do like:
sim_words = model.similar_by_word(word = 'bad') print(sim_words)
Similar words are:
[('terrible', 0.6373452544212341), ('horrible', 0.6125461459159851), ('good', 0.5624269843101501), ('either', 0.5428024530410767), ('complain', 0.5027004480361938), ('ok', 0.5009992122650146), ('awful', 0.4978830814361572), ('unspectacular', 0.4900318384170532), ('okay', 0.4786447584629059), ('mediocre', 0.4767637550830841)]
model.similar_by_word() can get top 10 words defautly, if you only want to get top 5, you can do like:
sim_words = model.similar_by_word(word = 'bad', topn=5) print(sim_words)
Top 5 similar words of “bad” are:
[('terrible', 0.6373452544212341), ('horrible', 0.6125461459159851), ('good', 0.5624269843101501), ('either', 0.5428024530410767), ('complain', 0.5027004480361938)]