Implement Word Level N-grams with Python – NLTK Tutorial

By | September 18, 2019

N-grams model is often used in nlp field, in this tutorial, we will introduce how to create word and sentence n-grams with python. You can use our tutorial example code to start to your nlp research.

What is n-grams?

As to n-grams, there are different levels.

As to sentence level:

“this is a good blog site.”

1-grams (unigrams) can be: this, is, a, good, blog, site, .

2-grams (bigrams) can be: this is, is a, a good, good blog, blog site, site.

3-grams (trigrams) can be: this is a, is a good, a good blog, good blog site, blog site.

However, if we apply n-grams on word level , n-grams model can be:

As to word: this

1-grams: t, h, i, s

2-grams: th, hi, is

3-grams: thi, his

How to get word level n-grams?

Create a python function to extract word level n-grams

def extract_word_ngrams(word, num = 3):
    word = word.strip()
    word = word.lower()
    #padding word to length
    if len(word) < num:
        word = format(word,'#^'+str(num))
    grams = []
    wlen = len(word)
    for i in range(wlen-num+1):
        w = word[i:i+num]
    return grams

In this function, we should notice, if the length of word is smaller than num.

For example:

n = 3, word = go

We should pad word to length 3, it will be go#

To pad word to a lenght, we can read this tutorial.

Best Practice to Pad Python String up to Specific Length – Python Tutorial

How to use this function?

grams = extract_word_ngrams(word='python', num = 3)

As to 3-grams, we can extract word ‘python‘ to:

['pyt', 'yth', 'tho', 'hon']

Extract word level n-grams in sentence with python

import nltk

def extract_sentence_ngrams(sentence, num = 3):
    words = nltk.word_tokenize(sentence)
    grams = []
    for w in words:
        w_grams = extract_word_ngrams(w, num)
    return grams

We can split a sentence to word list, then extarct word n-gams.

data = 'i like writting.'
grams = extract_sentence_ngrams(data, 3)

The sentence n-grams is:

[['#i#'], ['lik', 'ike'], ['wri', 'rit', 'itt', 'tti', 'tin', 'ing'], ['#.#']]

Leave a Reply

Your email address will not be published. Required fields are marked *