Next Word Prediction Using Markov Chain Model

    

Project Description

The "Next Word Prediction Using Markov Model" project is an academic initiative centered around the Markov chain model concept. Developed entirely in Python and utilizing the msvcrt module, this project aims to create a practical next word prediction system. The main objective is to leverage the predictive capabilities of Markov models to anticipate the most likely word to come after a given sequence of words. The implementation in Python, with the support of the msvcrt module, ensures a smooth and efficient execution, making it a valuable exploration of language modeling and prediction within the framework of Markov chains.

Why I chose this Project?

I chose to work on the this project due to the academic assignment context. In our class, we were assigned topics for academic project development, and the class was divided into multiple pairs, each tasked with a unique concept related to the subject. My group was assigned the concept of the Markov chain model. 

As I delved into learning about this concept and conducted research, I recognized the potential of creating a meaningful project based on it. The idea for the next word prediction system emerged during this exploration. It not only aligned perfectly with the assigned concept but also offered an opportunity to apply and extend my understanding of Markov models practically.

In essence, my choice to pursue this project was driven by the academic assignment's framework and the intriguing possibilities that the Markov chain model presented for the development of a valuable language prediction system.

Tech Stack Used:

  • Python
Next Word Prediction

Code Snippets: 

Section 1: Initializing Data Structures

train_data = 'markov_chain.txt'
first_possible_words = {}
second_possible_words = {}
transitions = {}
In this section, we set up the initial variables.
train_data
is the file containing the text data used for training the model. Three dictionaries (
first_possible_words
,
second_possible_words
, and
transitions
) will store information about the frequency and probability of word transitions.

Section 2: Helper Function for Dictionary Expansion

def expandDict(dictionary, key, value):
    if key not in dictionary:
        dictionary[key] = []
    dictionary[key].append(value)
This part of the code defines a helper function named
expandDict
. Its purpose is to make sure that a given dictionary has a specific key and that this key is associated with a list of values. If the key is not already in the dictionary, the function adds it and initializes the associated list. If the key is already present, it appends the provided value to the existing list.

Section 3: Probability Calculation Helper Function

def get_next_probability(given_list):  
    probability_dict = {}
    given_list_length = len(given_list)
    for item in given_list:
        probability_dict[item] = probability_dict.get(item, 0) + 1
    for key, value in probability_dict.items():
        probability_dict[key] = value / given_list_length
    return probability_dict
This function calculates the probability of occurrence for each item in a given list, ensuring that probabilities sum up to 1.

Section 4: Training the Markov Model

def trainMarkovModel():
    for line in open(train_data):
        tokens = line.rstrip().lower().split()
        tokens_length = len(tokens)
        for i in range(tokens_length):
            token = tokens[i]
            if i == 0:
                first_possible_words[token] = first_possible_words.get(token, 0) + 1
            else:
                prev_token = tokens[i - 1]
                if i == tokens_length - 1:
                    expandDict(transitions, (prev_token, token), 'END')
                if i == 1:
                    expandDict(second_possible_words, prev_token, token)
                else:
                    prev_prev_token = tokens[i - 2]
                    expandDict(transitions, (prev_prev_token, prev_token), token)
    
    first_possible_words_total = sum(first_possible_words.values())
    for key, value in first_possible_words.items():
        first_possible_words[key] = value / first_possible_words_total
        
    for prev_word, next_word_list in second_possible_words.items():
        second_possible_words[prev_word] = get_next_probability(next_word_list)
        
    for word_pair, next_word_list in transitions.items():
        transitions[word_pair] = get_next_probability(next_word_list)
    
This function reads the training data, tokenizes it, and populates the dictionaries (
first_possible_words
,
second_possible_words
, and
transitions
) with the necessary information for the Markov model.

Section 5: Next Word Prediction Function

def next_word(tpl):
    #print(transitions)
    if(type(tpl) == str):   #it is first word of string.. return from second word
        d = second_possible_words.get(tpl)
        if (d is not None):
            return list(d.keys())
    if(type(tpl) == tuple): #incoming words are combination of two words.. find next word now based on transitions
        d = transitions.get(tpl)
        if(d == None):
            return []
        return list(d.keys())
    return None #wrong input.. return nothing
This function takes either a single word or a tuple of two words and returns a list of possible next words based on the trained Markov model.

Section 6: Demo Code for User Interaction

trainMarkovModel()  #generate first, second words list and transitions

########## demo code below ################
print("Usage: start typing.. program will suggest words. Press tab to chose the first suggestion or keep typing\n")

import msvcrt   #use of mscvrt to get character from user on real time without pressing enter
c=''
sent=''
last_suggestion=[]
while(c != b'\r'):  #stop when user preses enter
    if(c != b'\t'): #if previous character was tab, then after autocompletion dont wait for user inpput.. just show suggestions
        c=msvcrt.getch()
    else:
        c = b' '
    if(c != b'\t'): #dont print tab etc
        print(str(c.decode('utf-8')), end='', flush=True)
    sent = sent + str(c.decode('utf-8'))  #create word on space
    if(c == b' '):
        tkns = sent.split()
        if(len(tkns) < 2):  #only first space encountered yet
            last_suggestion = next_word(tkns[0].lower())
            print(last_suggestion, end='  ', flush=True)
        else: #send a tuple
            last_suggestion = next_word((tkns[-2].lower(), tkns[-1].lower()))
            print(last_suggestion, end='  ', flush=True)
    if (c == b'\t' and len(last_suggestion) > 0):   #print last suggestion on tab
        print(last_suggestion[0], end='  ', flush=True)
        sent = sent + " " + last_suggestion[0]
 
This section initializes the Markov model using the training function and contains the demo code for user interaction. It continuously takes user input and suggests possible next words in real-time.

Challenges Faced:

During the project, I encountered a significant challenge that required a solution. After running the Python file and attempting to input text in the command panel, I noticed that although I could enter text, the predicted words were not appearing on the panel as expected.

To address this issue, I conducted extensive research and investigations. Eventually, I discovered a solution in the form of the msvcrt module. This module proved to be instrumental in achieving real-time predictions of words on the command panel, resolving the challenge I initially faced.

Public Code Repository:

You can find the source code for this project on GitHub.

References

For those interested in exploring the technologies and concepts discussed in this article, the following resources are recommended: