Create a Superhero Character Chatbot Using Haystack, Qdrant, Llama 3, and Streamlit

24 min read3 days ago

Ever since the rise of chatbots and conversational AI, developers and businesses have been exploring innovative ways to create engaging and interactive experiences for users. One such interesting application is building RAG (Retrieval Augmented Generation)-based chatbots that allow users to have conversations with their favorite superheroes, in their tone—a virtual superhero, if you will. Imagine playing a game where you can choose superhero characters as your allies and engage in conversations with them to solve puzzles, fight villains, or explore new worlds. This blend of RAG-based chatbots with superhero characters not only provides entertainment but also opens up a lot of possibilities for immersive storytelling, fan engagement, and educational interactions.

In this blog post, we will explore how to build a superhero chatbot that can mimic the personalities of popular superheroes and engage in conversations with users. Though the best way to create a superhero chatbot that mimics the personalities of superheroes is to fine-tune a pre-trained language model on a dataset of superhero dialogues, we will take a simpler approach. We will use the RAG-based chatbot model to generate responses in the style of popular superheroes. To build this RAG-based superhero chatbot, we will use Haystack, a framework that makes it easy to build end-to-end RAG-based chatbots. To store and retrieve the superhero dialogues, we will use the Qdrant Vector Database. To generate responses in the RAG-based chatbot, we will be using the Llama 3 model.

Applications — in Gaming & Entertainment

The concept of chatting with AI-powered superhero characters has many exciting applications in the gaming and entertainment industries. In video games, interactive dialogues with superheroes can enhance the playing experience, making players feel more involved in the storyline. For instance, gamers could receive real-time advice or mission briefings from their superhero allies, adding depth and personalization to the play.

This approach can also be used to create engaging marketing campaigns in the entertainment industry. Imagine a movie premiere where fans can interact with their favorite characters in real-time, asking questions and receiving responses in the superhero’s or heroine’s unique voice and tone. Additionally, themed attractions and virtual reality experiences can incorporate AI superheroes to provide visitors with a more immersive and interactive adventure. This can also serve an educational purpose, such as teaching kids about science and technology through conversations with characters they like or by encouraging good morals with insights from the superheroes themselves. By using these RAG-based chatbots for superhero characters’ tone mimicry, we can create engaging and interactive experiences for users, enhancing the overall entertainment value of games, movies, and other forms of media.

Problem Statement: Building a Superhero Character AI

As mentioned earlier, instead of fine-tuning a language model on a dataset of superhero dialogues, we will use the RAG-based chatbot model to generate responses in the style of popular superheroes. This approach is also called “few-shot learning,” where the model is given a set of examples to learn from and generate responses accordingly.

Before we start building the superhero chatbot, let’s first understand the key components involved in the process:

Data Collection: We will collect open-sourced movie scripts of popular superheroes.
Data Preprocessing: We will preprocess the data to extract dialogues and create embeddings using FastEmbed.
Vector Database: We will store the embeddings in the Qdrant vector database for efficient retrieval.
RAG Framework: We will use Haystack to build an RAG-based chatbot that can retrieve and generate responses. Haystack provides a simple and easy-to-use/understand API for building end-to-end RAG-based chatbots.
Llama 3 Model: We will use the Llama 3 model as LLM to generate responses based on the query and context. Llama 3 is a powerful large language model that is capable of generating high-quality responses. LLama 3 is responsible for understanding the context and generating responses in the style of the chosen superhero.
User Interface: We will create a simple user interface using Streamlit. The interface will allow us to select a superhero and engage in conversations with the AI-powered character.

Great! Now that we have a clear roadmap let’s dive into the implementation of the chatbot using the RAG framework.

RAG-Based Superhero Character Chatbot

This blog is divided into 4 parts. Let’s see what we are going to do in each part and understand the WHYs behind it.

Part 1: Data Collection and Preprocessing: Here, we will collect the movie scripts from the internet. We will then preprocess the data to make it ready for ingestion.
Part 2: Data Ingestion and Indexing: We will then ingest and index the data into the Qdrant Vector Database using FastEmbed.
Part 3: RAG-Based Chatbot for Mimicking a Conversation: We’ll use the indexed data to build a character chatbot using the Haystack framework.
Part 4: Building a User Interface for the Chatbot: Finally, we’ll build a user interface for the chatbot using Streamlit.

Let’s start.

Part 1: Data Collection and Preprocessing

Before we go ahead, let’s first see the list of superheroes whose dialogues we are going to use to build the Character Chatbot. I have also noted down the real names of superheroes to extract dialogue from the respective movie scripts. Apart from this, I have noted down the movie names of each superhero that we have on our list. All these details are saved in the config.yaml file. Let’s see the contents of the file.

# List of all the superheroes we are interested in
LIST_OF_SUPERHEROES: [
   "Batman",
   "Superman",
   "Wonder Woman",
   "Spiderman",
   "Ironman",
   "Captain America",
   "Black Widow",
   "Hulk",
   "Thor",
   "Deadpool",
   "Star Lord",
   "Thanos",
   "Groot",
   "Rocket",
   "Doctor Strange",
   "Drax",
   "Vision",
   "Jarvis",
]

# Every superhero has a real name and a mask name. This is a list of all the real names of the superheroes
# We will use this to identify the superhero in the dialogues and scripts
SUPERHERO_SYNONYMS:
   Batman: [
       "Bruce Wayne"
   ]
   Superman: [
       "Clark Kent"
   ]
   Wonder Woman: [
       "Diana Prince"
   ]
   Spiderman: [
       "Peter Parker"
   ]
   Ironman: [
       "Tony Stark"
   ]
   Captain America: [
       "Steve Rogers"
   ]
   Black Widow: [
       "Natasha Romanoff"
   ]
   Hulk: [
       "Bruce Banner"
   ]
   Thor: [
       "Thor Odinson"
   ]
   Deadpool: [
       "Wade Wilson"
   ]
   Star Lord: [
       "Peter Quill"
   ]
   Thanos: [
       "Thanos"
   ]
   Groot: [
       "Groot"
   ]
   Rocket: [
       "Rocket"
   ]
   Doctor Strange: [
       "Stephen Strange"
   ]
   Drax: [
       "Drax"
   ]
   Vision: [
       "Vision"
   ]
   Jarvis: [
       "Jarvis"
   ]

# List of all the movies in which the superheroes have appeared
# We will use this to separate the dialogues of each superhero
MOVIES_LIST_OF_SUPERHEROES:
   Batman: [
       "Batman-v-Superman-Dawn-of-Justice.pdf",
       "batman-begins-2005.pdf",
       "the-dark-knight-2008.pdf",
       "the-dark-knight-rises-2012.pdf",
   ]
   Superman: [
       "Batman-v-Superman-Dawn-of-Justice.pdf",
       "Man of Steel.pdf",
   ]
   Wonder Woman: [
       "wonder-woman-2017.pdf"
   ]
   Spiderman: [
       "Captain America: Civil War.pdf",
       "avengers infinity war.pdf",
       "spider-man-no-way-home-2021.pdf",
   ]
   Ironman: [
       "Avengers: Age of Ultron.txt",
       "Captain America: Civil War.pdf",
       "avengers infinity war.pdf",
       "avengers-endgame.pdf",
       "iron-man-2008.pdf",
       "the-avengers-2012.pdf",
   ]
   Captain America: [
       "Captain America: Civil War.pdf",
       "captain america winter solider.pdf",
       "captain-america-the-first-avenger-20.pdf",
       "avengers infinity war.pdf",
       "avengers-endgame.pdf",
       "the-avengers-2012.pdf",
       "Avengers: Age of Ultron.txt",
   ]
   Black Widow: [
       "avengers infinity war.pdf",
       "avengers-endgame.pdf",
       "the-avengers-2012.pdf",
       "Avengers: Age of Ultron.txt",
   ]
   Hulk: [
       "avengers infinity war.pdf",
       "avengers-endgame.pdf",
       "the-avengers-2012.pdf",
       "Avengers: Age of Ultron.txt",
   ]
   Thor: [
       "Avengers: Age of Ultron.txt",
       "avengers infinity war.pdf",
       "avengers-endgame.pdf",
       "the-avengers-2012.pdf",
       "thor-2011.pdf",
       "thor-ragnorak-2017.pdf",
   ]
   Deadpool: [
       "Deadpool 2.txt",
       "deadpool-2016.pdf",
   ]
   Star Lord: [
       "Guardians of the Galaxy vol 2.pdf",
       "Guardians_of_the_Galaxy_Movie_Transcript.pdf",
   ]
   Thanos: [
       "avengers infinity war.pdf",
       "avengers-endgame.pdf",
   ]
   Groot: [
       "Guardians of the Galaxy vol 2.pdf",
       "Guardians_of_the_Galaxy_Movie_Transcript.pdf",
   ]
   Rocket: [
       "Guardians of the Galaxy vol 2.pdf",
       "Guardians_of_the_Galaxy_Movie_Transcript.pdf",
   ]
   Doctor Strange: [
       "avengers infinity war.pdf",
       "spider-man-no-way-home-2021.pdf",
   ]
   Drax: [
       "Guardians of the Galaxy vol 2.pdf",
       "Guardians_of_the_Galaxy_Movie_Transcript.pdf",
   ]
   Vision: [
       "avengers infinity war.pdf",
       "Avengers: Age of Ultron.txt",
       "Captain America: Civil War.pdf",
   ]
   Jarvis: [
       "Avengers: Age of Ultron.txt",
       "the-avengers-2012.pdf",
       "iron-man-2008.pdf",
   ]

Now that we have seen all the superheroes whose dialogues we are going to use, let’s move on to the next part, where we will do the data preprocessing to extract the dialogues along with their small context from the movie scripts. We need not only the dialogues but also the context in which the dialogues are spoken. This will help us build a better chatbot.

Let’s first define some constants like the location of the data files, etc.

root = '.'
data_folder = 'data' # folder where all the data is stored
script_folder = 'scripts' # folder where all the scripts are stored
config_file = 'config.yaml' # file where the configuration is stored

Let’s now load the libraries as well.

import os
import re
import pymupdf # used to load the pdfs
from tqdm.notebook import tqdm
from os.path import join as pjoin

Great! Now, let’s load the configuration file so that we can use the details of the superheroes in our code.

import yaml

# load the configuration
with open(pjoin(root, config_file), 'r') as f:
   config = yaml.safe_load(f)

# used to loop through the scripts
list_of_superheroes = config['LIST_OF_SUPERHEROES']

# very essential for efficient dialogue extraction
# in some scripts, the name of the superhero name are interchanged with their real names
superhero_synonyms = config['SUPERHERO_SYNONYMS']

# used to get the relevant scripts for a particular superhero
movies_list_of_superheroes = config['MOVIES_LIST_OF_SUPERHEROES']

Here, we have loaded the configuration file and extracted the list of superheroes, their synonyms, and the list of movies for each superhero. This will help us extract the dialogue from the movie scripts.

# used to save the dialogues that are extracted from the scripts
dialogue_folder = 'dialogues'
# with each dialogue, we want to have a context of the previous dialogues so that the model can learn and understand the dialogues better
# we will try to keep the context length small
max_context_length = 100
# each dialogue will be combined with the previous dialogues and in between them, we will add a special token
# this is done so as to have one single txt file for a movie for a particular superhero
dialogues_joiner = '\n|_/-|_/-|_/-|_/-|_/-|_/-|_/-|_/-|_/-|_/-|\n\n'

# used to load the pdfs of the scripts
data_folder_path = pjoin(root, data_folder)
all_movie_scripts = os.listdir(pjoin(root, data_folder, script_folder))

Here, we have defined some additional constants like the folder where we will save the dialogues, the maximum context length that we want to keep associated with each extracted dialogue, and the special token that we will use to join the dialogues so that we can have one single txt file for each dialogue of a movie for a particular superhero.

Now, let’s define some helper functions that will help us extract the dialogues of each character.

def extract_text_from_pdf(pdf_path):
   '''
   This function extracts the text from a pdf file using the pymupdf library
   '''
   # open the pdf file
   pdf = pymupdf.open(pdf_path)
   text = ''
   for page in pdf:
       # extract the text from the page and keep adding it to the text variable
       text += page.get_text()
   return text

def get_all_superhero_names(superhero, superhero_synonyms):
   '''
   With each superhero, we will have a list of names that the superhero can be referred to in the script
   For example, for Batman, the names can be Batman, Bruce Wayne, Bruce, Wayne, Bruce-Wayne, etc.
   '''

   # get the superhero synonym which essentially is the real name of the superhero
   superhero_synonym = superhero_synonyms[superhero][0]
   # get all the possible names of the superhero
   superhero_names = [superhero.upper(), superhero_synonym.upper(), superhero_synonym.replace(' ', '-').upper()]
   superhero_names = superhero_names + [i.upper () for i in superhero_synonym.split()]
   return superhero_names

def split_script_by_superhero_dialogue(script_text, superhero_names):
   '''
   This function splits the script such that we have the split points where the dialogues of the superhero start
   Since we're only interested in the dialogues of the superhero, we will split the script based on the dialogues of the superhero
   '''
   # we will find all the matches of the superhero names in the script
   matches = re.finditer("|".join(superhero_names), script_text)
   # get the split points where the dialogues of the superhero start
   split_points = [match.start() for match in matches][1:] + [len(script_text)]
   # extract the dialogues of the superhero
   extrcated_split_script_text = [script_text[split_points[i]:split_points[i+1]] for i in range(len(split_points) - 1)]
   return extrcated_split_script_text

def remove_extra_charachters_dialogue_from_each_split(extrcated_split_script_text, max_extra_dialogues=3):
   '''
   This function removes the extra characters from the dialogues extracted from the script
   It checks if there are other characters in the dialogues other than the dialogues of the superhero
   This is done by checking if a line has only uppercase characters, spaces, and some special characters
   This means that the line is indicative of a start of a new dialogue
   We only keep the dialogues till the max_extra_dialogues and remove the rest
   '''
   # pattern to check if a line has only uppercase characters, spaces, and some special characters
   pattern = re.compile(r'^[A-Z\s\'().,-]+$', re.MULTILINE)
   extrcated_split_script_text_filtered = []

   for idx in range(len(extrcated_split_script_text)):
       # find all the matches of the pattern in the dialogue
       matches = re.finditer(pattern, extrcated_split_script_text[idx])
       # get the indices of the matches
       indices = [match.start() for match in matches]
       # if there are more than max_extra_dialogues, we only keep the dialogues till the max_extra_dialogues
       if len(indices) >=1:
           max_indices = len(extrcated_split_script_text[idx]) if len(indices) == 1 else indices[:max_extra_dialogues][-1]
           extrcated_split_script_text_filtered.append(extrcated_split_script_text[idx][:max_indices])
  
   return extrcated_split_script_text_filtered

def combine_dialogue_with_context(script_text, extrcated_split_script_text_filtered, max_context_length):
   '''
   Combine the dialogues with the context of the previous dialogues.
   This is very essential for the model to learn the dialogues better. A dialogue without context is of no use.
   Dialogues when combined with the context of the previous dialogues can help the model understand the dialogues better.
   '''
   dialogue_with_context_all = []
   # loop through all the dialogues
   for idx in range(len(extrcated_split_script_text_filtered)):
       # for each dialogue, get the index of the start of the dialogue in the script
       dialogue_idx = script_text.find(extrcated_split_script_text_filtered[idx])
       # add the context of the previous dialogues to the current dialogue and append it to the list
       dialogue_with_context = script_text[dialogue_idx-max_context_length:dialogue_idx] + extrcated_split_script_text_filtered[idx]
       dialogue_with_context_all.append(dialogue_with_context)

   return dialogue_with_context_all

Now, let’s extract the dialogue of each character from their movies.

# loop through all the superheroes
for superhero in tqdm(list_of_superheroes):
   superhero_script = []
   # loop through all the scripts of the superhero
   for script in movies_list_of_superheroes[superhero]:
       superhero_dialogue_save_path = pjoin(data_folder_path, dialogue_folder, superhero)
       save_script_name = ".".join(script.split('.')[:-1])+'.txt'
       script_path = pjoin(data_folder_path, script_folder, script)
       os.makedirs(superhero_dialogue_save_path, exist_ok=True)

       # extract the text from the pdf
       script_text = extract_text_from_pdf(script_path)
       # get all the names of the superhero
       superhero_names = get_all_superhero_names(superhero, superhero_synonyms)
       # split the script based on the dialogues of the superhero
       extrcated_split_script_text = split_script_by_superhero_dialogue(script_text, superhero_names)
       # remove the extra characters from the dialogues
       extrcated_split_script_text_filtered = remove_extra_charachters_dialogue_from_each_split(extrcated_split_script_text, max_extra_dialogues=3)
       # combine the dialogues with the context of the previous dialogues
       dialogues_with_context = combine_dialogue_with_context(script_text, extrcated_split_script_text_filtered, max_context_length)
       # join the dialogues with the context
       dialogues_with_context_combined = f"{dialogues_joiner}".join(dialogues_with_context)
       # save the dialogues with the context to a txt file
       with open(pjoin(superhero_dialogue_save_path, save_script_name), 'w') as f:
           f.write(dialogues_with_context_combined)

The above code does the following:

Extract the text from the PDF file using the PyMuPDF library.
Splits the script based on the superhero’s dialogues. We are only interested in the dialogues of the superhero.
Removes the extra characters from the dialogue. We’ll only keep the dialogues that have uppercase characters, spaces, and some special characters — and that, too, till a maximum of 3 dialogues.
Combines the dialogues with the context of the previous dialogues. This is essential for the model to learn the dialogues better. A dialogue without context is of no use. Dialogues, when combined with the context of the previous dialogues, can help the model understand the dialogues better.
Saves the dialogues with the context to a txt file.

So, based on these preprocessing steps, we have now successfully converted the movie scripts to dialogues with a context only for the superheroes that we are interested in. This will help us build a better chatbot. The quality of the data is very important for the model to give good results. Currently, each dialogue for a superhero that we create has 3 main aspects to it: Context, Dialogue, and extra characters’ dialogues.

Ideally, we would like to have only the context and the dialogue, removing the extra characters’ dialogues along with some screenplays that are present in the dialogues. For this blog, I have tried to keep it as it is: context, dialogue, and extra characters’ dialogues. But you can always modify the code to remove the extra characters’ dialogues and screenplays from the dialogues as well. One other possible preprocessing step post-extraction is to extract just the dialogues and context and remove the extra characters, dialogues, and screenplays from the dialogues using LLMs. This will help improve the quality of the data. I have provided the code for the same in the code section below, but I have not used it in this blog since filtering the dialogues using LLMs is a time-consuming and resource-consuming process. But you can always use it to improve the quality of the data. I am providing the codes for this if you want to give it a try.

import torch
import transformers

def load_model_pipeline(model_id, batch_size):
   '''
   Load the model pipeline with the model id and the batch size
   '''
   pipeline = transformers.pipeline(
       "text-generation",
       model=model_id,
       model_kwargs={"torch_dtype": torch.bfloat16},
       device_map="auto",
       batch_size=batch_size,
   )

   torch.backends.cuda.enable_mem_efficient_sdp(False)
   torch.backends.cuda.enable_flash_sdp(False)
   return pipeline

def extract_dialogue_from_llm(pipeline, messages):
   '''
   Extract the dialogues only from the model based on the messages
   '''
   pipeline.tokenizer.pad_token_id = pipeline.tokenizer.eos_token_id
   pipeline.tokenizer.padding_side = 'left'


   terminators = [
       pipeline.tokenizer.eos_token_id,
       pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
   ]


   outputs = pipeline(
       messages,
       max_new_tokens=256,
       eos_token_id=terminators,
       do_sample=True,
       temperature=1,
       top_p=1,
   )


   return outputs

def create_batch(extrcated_split_script_text, superhero, superhero_names, batch_size):
   '''
   To make the extraction of the dialogues faster, we will create a batch of messages
   Messages are the prompts that has information about the system and the user
   '''
   messages_batch = []
   for extracted_text in tqdm(extrcated_split_script_text[:batch_size]):
       system_prompt = f"You are a movie dialogue separator. From the context you are given, separate the dialogue and provide the dialogue of a charachter. You are only allowed to give final dialoige without any thing. Don't say anything else, just list the dialogue. Always start with the NAME of the character followed by a colon and then the dialogue. The extracted dialogue should always be in single line. Make sure that you extract all the dialouges of the asked charachters. It can be present in multiple lines. These are the identifier for charachter dialoges for which you need to extrcat the dialouges: {", ".join([f"'{i}'" for i in superhero_names])} The identifier are always in captital leter."
       user_prompt = f"Extract only the dialogues of {superhero.upper()} - Synonyms of {superhero.upper()} are {", ".join([f"'{i}'" for i in superhero_names])}. Now extract dialogue based on the synonyms given from the following text\n\n\n\n {extracted_text} \n\n\n\n\n Make sure you only extract dialogue of {", ".join([f"'{i}'" for i in superhero_names])}. The dialogues starts only after the name of the charachter is in capital letter."

       messages = [
           {"role": "system", "content": system_prompt},
           {"role": "user", "content": user_prompt},
       ]
       messages_batch.append(messages)
   return messages_batch

batch_size = 8
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

pipeline = load_model_pipeline(model_id, batch_size)
messages_batch = create_batch(extrcated_split_script_text, superhero, superhero_names, batch_size)
extrcated_dialogue = extract_dialogue_from_llm(pipeline, messages_batch)

Now that we have successfully preprocessed the data let’s move on to the next part, where we will ingest and index the data into the Qdrant Vector Database using FastEmbed.

Part 2: Data Ingestion and Indexing

In this section, we will create separate vector stores for each superhero. We’ll start by loading all the dialogues available for each superhero, and then we’ll create the embeddings for each dialogue. We will then store these embeddings along with the corresponding document in the vector store. Once we have stored all the embeddings, we can, at a later stage, retrieve the embeddings for each superhero and use them to calculate the similarity between the superheroes. Here, we will be using local storage to store the embeddings. Qdrant-Haystack integration provides other storage options as well, like in-memory storage, Qdrant Cloud storage, and local storage. You can also connect to the Qdrant vector store using the Docker container. In that case, you should pass the URL of the Qdrant container to the `vector_store` parameter.

Great! Let’s now go ahead and create the vector store for each superhero.

Let’s first define some constants.

root = '..'
data_folder = 'data'
script_folder = 'scripts'
dialogue_folder = 'dialogues'
config_file = 'config.yaml'
embed_dim = 384 # Embedding dimension for the document embedder
vector_store_name = 'QDRANT_VECTOR_DATABASE' # location of the vector store
vector_store_path = pjoin(root, vector_store_name)
embedding_model = 'BAAI/bge-small-en-v1.5' # Embedding model for the document embedder

with open(pjoin(root, config_file), 'r') as f:
   config = yaml.safe_load(f)

dialogues_joiner = config['DIALOGUES_JOINER']
list_of_superheroes = config['LIST_OF_SUPERHEROES']

Great. Now, let’s go ahead with embedding generation and indexing.

from haystack_integrations.components.embedders.fastembed import FastembedDocumentEmbedder
from haystack import Document
import yaml
import os
from os.path import join as pjoin
from haystack_integrations.document_stores.qdrant import QdrantDocumentStore

def load_all_dialogues(movies_available, superhero_dialogue_path, dialogues_joiner):
   '''
   This function loads all the dialogues from all the movies available for a particular superhero.
   '''
   dialogues = []
   movies_as_meta_data = []
   for movie in movies_available:
       with open(pjoin(superhero_dialogue_path, movie)) as file:
           data = file.read()

       temp = data.split(dialogues_joiner)
       movies_as_meta_data.extend([movie]*len(temp))
       dialogues.extend(temp)

   return dialogues, movies_as_meta_data

def load_embedding_model(embedding_model):
   '''
   Load the embedding model for the document embedder.
   '''
   document_embedder = FastembedDocumentEmbedder(model = embedding_model)
   document_embedder.warm_up()
   return document_embedder

def index_documents(superhero, dialogues, movies_as_meta_data, vector_store_path, document_embedder, embed_dim):
   '''
   This function indexes the documents in the Qdrant Document Store.
   '''
   # Create Haystack Documents
   documents = [
       Document(
           content = dialogue,
           meta = {
               'name': superhero,
               'movie': movie[:-4]
           }
       )
       for dialogue, movie in zip(dialogues, movies_as_meta_data) if len(dialogue)<2000
   ]

   # Instantiate Qdrant Document Store
   document_store = QdrantDocumentStore(
       path= vector_store_path,
       index=superhero,
       embedding_dim=embed_dim,
   )

   # Get embeddings for the documents
   documents_with_embeddings = document_embedder.run(documents)["documents"]

   # Write documents to the document store with embeddings and save it in the local directory
   document_store.write_documents(documents_with_embeddings)

   return document_store

for superhero in list_of_superheroes:
   print(f'Generation vector-index for {superhero}')
   movies_available = os.listdir(pjoin(root, data_folder, dialogue_folder, superhero))
   superhero_dialogue_path = pjoin(root, data_folder, dialogue_folder, superhero)

   dialogues, movies_as_meta_data = load_all_dialogues(movies_available, superhero_dialogue_path, dialogues_joiner)
   document_embedder = load_embedding_model(embedding_model)
   index_documents(superhero, dialogues, movies_as_meta_data, vector_store_path, document_embedder, embed_dim)

Great. With this, we have successfully created the vector store for each superhero. Wait for some time for the embeddings to be stored in the vector store. It might take some time to store the embeddings. Once the embeddings are stored, we can move on to the next section, where we will calculate the similarity between the superheroes.

Part 3: RAG-Based Chatbot for Mimicking a Conversation

Now that we have saved the vector stores of each superhero, we can go ahead and build the RAG-based chatbot. In this section, we will be building the chatbot using the Haystack library. The Haystack library is a framework for building end-to-end search pipelines that enable us to build powerful search and QA systems. Haystack, at the implementation level, is very intuitive and easy to use. Like any other RAG-based application, this blog will also have the following components:

Embeddings for the query: We will be using the FastEmbed model to create the query embeddings.
Retriever: This retrieves the relevant documents from the document store based on the query embeddings.
prompt_builder: This is used to build the prompt template for the RAG model.
Generator: This generates the answer to the query based on the prompt template.

We first start off by defining some constants similar to the previous approach.

from os.path import join as pjoin
import yaml

root = '..'
data_folder = 'data'
script_folder = 'scripts'
dialogue_folder = 'dialogues'
config_file = 'config.yaml'
embed_dim = 384
vector_store_name = 'QDRANT_VECTOR_DATABASE'
vector_store_path = pjoin(root, vector_store_name)
embedding_model = 'BAAI/bge-small-en-v1.5'
llm_model = "meta-llama/Meta-Llama-3-8B-Instruct"
max_new_tokens = 250
number_of_documents_to_retrieve = 5
superhero = 'Thanos'
config_path = pjoin(root, config_file)

Before we dive into the details of building the RAG application using Haystack, let’s first define a few functions that will help ease the process.

from haystack import Pipeline
from haystack_integrations.document_stores.qdrant import QdrantDocumentStore
from haystack_integrations.components.retrievers.qdrant import QdrantEmbeddingRetriever
from haystack.components.generators import HuggingFaceTGIGenerator
from haystack_integrations.components.embedders.fastembed import FastembedTextEmbedder
from haystack.components.builders import PromptBuilder
from haystack.components.others import Multiplexer

def get_superhero_names(superhero, config_path):
   '''
   This function returns a list of superhero names and their synonyms
   '''
   with open(config_path, 'r') as f:
       config = yaml.safe_load(f)

   superhero_synonym = config['SUPERHERO_SYNONYMS'][superhero][0]
   superhero_names = [superhero.upper(), superhero_synonym.upper(), superhero_synonym.replace(' ', '-').upper()]
   superhero_names = list(set(superhero_names))
   return superhero_names

def load_document_store(vector_store_path, superhero, embed_dim):
   '''
   This function loads the QdrantDocumentStore that will be used by retriever to fetch the context documents
   '''
   document_store = QdrantDocumentStore(
       path=vector_store_path,
       index=superhero,
       embedding_dim=embed_dim,
   )
   return document_store

def build_prompt():
   '''
   This function builds the prompt template that will be used by the PromptBuilder
   '''
   prompt = """
   You are a helpful AI assistant that mimics the tone of the specified character based on provided context documents.
   Use the context to capture and replicate the character's tone accurately.


   You will be given a set of CONTEXT documents, which you should use to understand and replicate the character's tone in your response.
   The context should primarily inform the tone rather than the content of your answer.
   You may answer questions without the context if it is not necessary, but always ensure your tone matches that of the character.


   Respond without prefacing with phrases like "Based on the context..." or "I think...".


   If the context is not necessary to answer the question, you may ignore it.
   You may also use your own knowledge of the character tone to answer the question in the same tone.


   When you start your response, ensure that it is clear that you are answering the question.
   Do not say things like "Please respond with the tone of...". Just directly answer the question in the character's tone.


   ******************************************
   -----------CONTEXT STARTS HERE------------
   {% for doc in documents %}
       {{ doc.content }}
   ------------------------------------------
   {% endfor %}
   -----------CONTEXT ENDS HERE------------
   ******************************************
   Copy the tone of these charachters : {{ superhero_names }} dialogue and answer the following question:
   ******************************************
   Question: {{ query }}
   ******************************************
   Answer:
   """

   # Define the prompt builder
   prompt_builder = PromptBuilder(template=prompt)
   return prompt_builder

def build_rag_pipeline(embedder, retriever, prompt_builder, generator):
   '''
   This is the main function that builds the RAG pipeline
   We start by creating a Pipeline object and adding the components to it
   We then connect the components to each other. This connection essentially defines the flow of data between the components
   '''
   rag = Pipeline()

   rag.add_component(instance=Multiplexer(str), name="multiplexer")

   rag.add_component("embedder", embedder)
   rag.add_component("retriever", retriever)
   rag.add_component("prompt", prompt_builder)
   rag.add_component("llm", generator)

   rag.connect("multiplexer.value", "embedder.text")
   rag.connect("multiplexer.value", "prompt.query")

   rag.connect("embedder.embedding", "retriever.query_embedding")
   rag.connect("retriever.documents", "prompt.documents")
   rag.connect("prompt.prompt", "llm")

   return rag

Now that we have defined all the utility functions let’s move on to the next step.

superhero_names = get_superhero_names(superhero, config_path)
document_store = load_document_store(vector_store_path, superhero, embed_dim)
prompt_builder = build_prompt()

generator = HuggingFaceTGIGenerator(model=llm_model, generation_kwargs={"max_new_tokens": max_new_tokens})
retriever = QdrantEmbeddingRetriever(document_store=document_store, top_k=number_of_documents_to_retrieve)
embedder = FastembedTextEmbedder(model = embedding_model)

rag = build_rag_pipeline(embedder, retriever, prompt_builder, generator)

Here, we have built the RAG pipeline using the Haystack library. We first started by getting the superhero names and then loading the document store. We then built the prompt template and, finally, the RAG pipeline using the FastEmbed model to create the embeddings for the query. The QdrantEmbeddingRetriever is used to retrieve the relevant documents from the document store based on the query embeddings. The HuggingFaceTGIGenerator is used to generate the answer to the query based on the prompt template. Finally, we built the RAG pipeline using the embedder, retriever, prompt_builder, and generator.

One interesting thing about the Haystack library is that it allows us to visualize the RAG pipeline. We can visualize the pipeline using the `pipeline.show()` method. This will show the pipeline in a graphical format. This visualization is very helpful in understanding the flow of the pipeline and how the different components are connected to each other. The RAG pipeline we just made is shown below.

Great! We can now see how our RAG pipeline is designed. Now that we have built the RAG pipeline, we can go ahead and test the chatbot. We can start asking questions about the superheroes. The chatbot will retrieve the relevant information from the document store and generate the answer to the query.

question = 'Who are you?'

pipeline_input = {
   "multiplexer": {
       "value": question,
   },
   "prompt": {
       "superhero_names": superhero_names
   }
}

result = rag.run(pipeline_input)
response = result['llm']['replies'][0]
print(response)

Let’s check the response:

Well, it does sound like Thanos. Great, now we have built a character chatbot.

Part 4: Building a User Interface for the Chatbot

In this section, we will be building a user interface for the character chatbot. We will be using the Streamlit library. Streamlit is a powerful library that allows us to build interactive web applications using simple Python scripts. We will be building a simple web application that will allow the user to select a superhero from the drop-down list and ask questions about the virtual superhero.

Let’s first create a helper function that will be used to generate the response in the Streamlit application. The process is exactly the same as in the previous section.

from os.path import join as pjoin
from haystack_integrations.components.retrievers.qdrant import QdrantEmbeddingRetriever
from haystack.components.generators import HuggingFaceTGIGenerator
from haystack_integrations.components.embedders.fastembed import FastembedTextEmbedder

from utils import get_superhero_names, load_document_store, build_prompt, build_rag_pipeline

root = '.'
data_folder = 'data'
script_folder = 'scripts'
dialogue_folder = 'dialogues'
config_file = 'config.yaml'
embed_dim = 384
vector_store_name = 'QDRANT_VECTOR_DATABASE'
vector_store_path = pjoin(root, vector_store_name)
embedding_model = 'BAAI/bge-small-en-v1.5'
embedding_model = 'BAAI/bge-small-en-v1.5'
llm_model = "meta-llama/Meta-Llama-3-8B-Instruct"
max_new_tokens = 250
number_of_documents_to_retrieve = 5

config_path = pjoin(root, config_file)

def get_response(question, superhero):
   superhero_names = get_superhero_names(superhero, config_path)
   document_store = load_document_store(vector_store_path, superhero, embed_dim)
   prompt_builder = build_prompt()

   generator = HuggingFaceTGIGenerator(model=llm_model, generation_kwargs={"max_new_tokens": max_new_tokens})
   retriever = QdrantEmbeddingRetriever(document_store=document_store, top_k=number_of_documents_to_retrieve)
   embedder = FastembedTextEmbedder(model = embedding_model)

   rag = build_rag_pipeline(embedder, retriever, prompt_builder, generator)

   pipeline_input = {
       "multiplexer": {
           "value": question,
       },
       "prompt": {
           "superhero_names": superhero_names
       }
   }

   result = rag.run(pipeline_input)
   response = result['llm']['replies'][0]

   return response

Now, we will be writing some code in Streamlit to create a nice-looking user interface.

import streamlit as st
from streamlit_chat import message

from os.path import join as pjoin
import yaml
root = '.'
data_folder = 'data'
script_folder = 'scripts'
dialogue_folder = 'dialogues'
config_file = 'config.yaml'

with open(pjoin(root, config_file), 'r') as f:
   config = yaml.safe_load(f)
superheroes = config['LIST_OF_SUPERHEROES']

from query import get_response

def api_calling(question, superhero):
   message = get_response(question, superhero)
   return message

# Define superhero names
st.title("Chat with your favorite superhero!")

# Sidebar for superhero selection
st.sidebar.title("Superhero Selector")
selected_hero = st.sidebar.selectbox("Choose a superhero", superheroes)
# Display the selected superhero
st.sidebar.write(f"You selected: {selected_hero}")

# Initialize selected superhero in session state
if 'selected_hero' not in st.session_state:
   st.session_state['selected_hero'] = superheroes[0]  # Set default superhero

if 'user_input' not in st.session_state:
   st.session_state['user_input'] = []

if 'response' not in st.session_state:
   st.session_state['response'] = []

def get_text():
   input_text = st.text_input("write here", key="input")
   return input_text

user_input = get_text()

if user_input:
   output = api_calling(user_input, selected_hero)
   output = output.lstrip("\n")

   # Store the output
   st.session_state.response.append(user_input)
   st.session_state.user_input.append(output)

   # Clear the text input field
   user_input = ""  # Resetting the value to an empty string

message_history = st.empty()

if st.session_state['user_input']:
   for i in range(len(st.session_state['user_input']) - 1, -1, -1):
       # This function displays user input
       message(st.session_state["user_input"][i],
               key=str(i),avatar_style="icons")
       # This function displays response
       message(st.session_state['response'][i],
               avatar_style="miniavs",is_user=True,
               key=str(i) + 'data_by_user')
       # st.session_state['user_input'] = []

# Check if superhero selection has changed
if st.session_state['selected_hero'] != selected_hero:
   # Update selected superhero in session state
   st.session_state['selected_hero'] = selected_hero
   st.session_state['response'] = []
   st.session_state['user_input'] = []
   # Rerun the app to start fresh
   st.rerun()

# Update selected superhero in session state
st.session_state['selected_hero'] = selected_hero

This is how our UI looks. On the left side, we have the option to select the superhero with whom we would like to chat.

You can select any superhero you want from the left sidebar as shown below:

Now, let’s check the results for Deadpool.

Query

Tell me about yourself.

Response

Hey, so you want to know about me? Well, I'm Wade Wilson, the Merc with a Mouth, the Regenerating Degenerate, the guy who's been through more crap than a Kardashian's Instagram feed. I've got a mouth that won't stop, a sword that won't quit, and a heart that's been broken more times than a piñata at a kindergartener's birthday party.
I used to be a special forces operative, but then I got diagnosed with terminal cancer. And let me tell you, it's a real party-pooper. But then I got "cured" by a bunch of crazy scientists who experimented on me with a serum that gave me my regenerative powers. It's like I'm a human version of the Terminator, minus the whole "I'll be back" thing.
Now, I'm a mercenary, a vigilante, and a general all-around troublemaker. I've got a penchant for breaking the

Let’s check the results for Thanos.

Query

Who are the Avengers?

Response

The Avengers. A group of insignificant, fleeting beings, no different from the countless others I have wiped from existence. They are a nuisance, a distraction from the true purpose of the universe. And yet, they dare to challenge me, to think they can stop me from achieving my goal. But I will crush them, just as I have crushed all who have dared to stand in my way. The Avengers are nothing but a fleeting moment in the grand tapestry of time, and I will erase them from existence just as I have erased countless others before them.

Now, let’s chat with Batman.

Query

What do you think about Superman?

Response

I think Superman is an alien. And a trespasser and it doesn't belong on this planet. Perhaps you share my view?

Now, let’s chat with IronMan.

Query

What do you think you are?

Response

I think I know who I am. I'm a genius, billionaire, playboy philanthropist. I'm a superhero, a hero, a legend. I'm the one who's going to change the world, who's going to make a difference. I'm the one who's going to save the day, who's going to be the one to make it all better. And if you don't like it, well, that's your problem. I'm Tony Stark, and I'm not going anywhere.

Well, all of these do sound like the characters themselves!

Conclusion

We have successfully built a superhero character chatbot using the Haystack framework. We have also built a user interface using Streamlit. This chatbot can be used to interact with the superheroes and get responses from them.

Though this started as a fun project, it has the potential to be expanded into a full-fledged chatbot that can engage with users on a variety of topics. Building a chatbot for superhero characters is a great way to engage with fans and provide them with a unique experience. The chatbot can be further improved by adding more superheroes and dialogues to the dataset.

I hope you enjoyed reading this blog and learned something new. If you have any questions or feedback, feel free to leave a comment below. Thank you for reading!

Code: GitHub

The code for this blog can be found at: https://github.com/quamernasim/Superhero-Character-Based-On-RAG-AI-Using-Haystack-And-Qdrant

Create a Superhero Character Chatbot Using Haystack, Qdrant, Llama 3, and Streamlit

Applications — in Gaming & Entertainment

Problem Statement: Building a Superhero Character AI

RAG-Based Superhero Character Chatbot

Part 1: Data Collection and Preprocessing

Part 2: Data Ingestion and Indexing

Part 3: RAG-Based Chatbot for Mimicking a Conversation

Part 4: Building a User Interface for the Chatbot

Conclusion

Code: GitHub

References

Written by M Quamer Nasim