Python Gensim Read Word2Vec Word Embeddings and Compute Word Similarity

Python gensim library can load word2vec model to read word embeddings and compute word similarity, in this tutorial, we will introduce how to do for nlp beginners.

Create a word2vec bin or text file

You should use some text to train a word embeddings file using word2vec, it has two types: binary or text. You can read this tutorial to learn how to do.

Best Practice to Create Word Embeddings Using Word2Vec – Word2Vec Tutorial

Install python gensim

You should install python gensim library then you can use it to load word2vec embeddings file.

Install Python Gensim with Anaconda on Windows 10: A Beginner Guide – Gensim Tutorial

Import library

# -*- coding: utf-8 -*-  
import gensim

Load word2vc embeddings file

We should load word2vec embeddings file, then we can read a word embedding to compute similarity.

If your word2vec file is binary, you can do like:

model = gensim.models.KeyedVectors.load_word2vec_format('yelp-2013-embedding-200d.bin', binary=True)

If file is text, you can load it by:

model = gensim.models.KeyedVectors.load_word2vec_format('yelp-2013-embedding-200d.txt', binary=False)

where word2vec embeddings file is yelp-2013-embedding-200d.txt.

Get word index in vocabulary

To get the index of a word in vocabulary, we can use this code.

#get word vocabulary
vab = model.vocab
word = vab['bad']
print(word.index)

Then you will find the index of word “bad” is 216 in vocabulary.

Get word by word index

We also can get word by its index in vocabulary. For example:

w = model.index2word(216)
print(w)

We can file the word is “bad” by word index 216.

Compute the similarity of two words

We can compute the similarity of two words by cosine distance, here is an example:

sim = model.similarity('love', 'bad')
print("sim = " + str(sim))

From the result, we can find the similarity (cosine distance) of words “love” and “bad” is:

sim = 0.163886218155

Get word embeddings

We can get the embeddings of a word easily.

vec = model.word_vec('bad')
print(vec)
print(type(vec))

Then we can get the word embeddings of word “bad” is:

[ -2.96425015e-01  -3.69928002e-01   1.06517002e-01  -1.85122997e-01
  -1.12859998e-02  -2.23900005e-01   3.68850008e-02  -2.12399997e-02
  -1.75759997e-02   3.26476008e-01   5.16830012e-02  -7.16490000e-02
  ...
  -3.25680003e-02   3.51186007e-01  -2.08217993e-01   1.31810000e-02
   1.08323999e-01   1.91893995e-01  -2.82000005e-02   2.78019998e-02
   2.08480999e-01  -3.19326997e-01  -5.16390018e-02  -7.68799987e-03]

The type of vec is: <class ‘numpy.ndarray’>

Of course, you also can get “bad” word embeddings by a simple way:

vec = model['bad']
print(vec)

The result is also the same.

Notice: if word is not in vocabulary, it will raise an error, for example:

vec = model.word_vec('badsdfadafdfawwww')
print(vec)

It will raise: KeyError: “word ‘badsdfadafdfawwww’ not in vocabulary”

Get top N similar words of a word

If you want to get top n similar words of a word ‘bad‘, you can do like:

sim_words = model.similar_by_word(word = 'bad')
print(sim_words)

Similar words are:

[('terrible', 0.6373452544212341), ('horrible', 0.6125461459159851), ('good', 0.5624269843101501), ('either', 0.5428024530410767), ('complain', 0.5027004480361938), ('ok', 0.5009992122650146), ('awful', 0.4978830814361572), ('unspectacular', 0.4900318384170532), ('okay', 0.4786447584629059), ('mediocre', 0.4767637550830841)]

model.similar_by_word() can get top 10 words defautly, if you only want to get top 5, you can do like:

sim_words = model.similar_by_word(word = 'bad', topn=5)
print(sim_words)

Top 5 similar words of “bad” are:

[('terrible', 0.6373452544212341), ('horrible', 0.6125461459159851), ('good', 0.5624269843101501), ('either', 0.5428024530410767), ('complain', 0.5027004480361938)]