Create a LLM-powered Discord Bot

llm
ml
nlp
Author

Hamza

Published

November 7, 2023

In this beginner friendly blog post we are going to build an LLM-powered Discord Bot using llama.cpp, here are the steps we are going to follow: 1. Download a LLM from huggingface 2. Setup a REST API to use this model 3. Create a Discord Bot Application 4. Use the REST API to operate the bot

This is a beginner friendly tutorial and assumes only basic knowledge of Python and Linux. I will be working in Ubuntu 20.04 installed on WSL2 but any other Linux distribution should work too.

1. Prerequisites:

1.1 Tools:

Let’s start first with an update to apt to get the latest metadata for Ubuntu packages. Open your Linux/WSL2 terminal and type the following command:

sudo apt-get update

Now let’s install some packages that we’re going to need.

sudo apt-get install python3-virtualenv python3-pip curl jq 
  • virtualenv is a tool to create isolated Python environments. It creates a folder which contains all the necessary executables to use the packages that a Python project would need. It’s better to create a virtual environment for each project so that even if we mess up our installations, the mess wouldn’t spread to all of the system.
  • pip is the standard package manager for Python is pip . It allows you to install and manage packages that aren’t part of the Python standard library.
  • curl curl (short for “Client URL”) is a command line tool that enables data transfer over various network protocols. It communicates with a web or application server by specifying a relevant URL and the data that need to be sent or received.
  • jq is a lightweight and flexible command-line JSON processor that we will use to parse and format replies from our API.
    Now that we installed pip let’s install more tools that we’re going to need:
pip install discord.py python-dotenv
  • discord.py is a Python library that exhaustively implements Discord’s APIs in an efficient and Pythonic way. This includes utilizing Python’s implementation of Async IO.
  • python-dotenv eads key-value pairs from a . env file and can set them as environment variables. It helps in the development of applications following the 12-factor principles.

1.2 The Model:

We are going to use llama.cpp so you should download a model that is compatible with it. You can browse HuggingFace and check the model card to see if it is compatible with llama.cpp. Another way is to download models with the GGUF extension.

I am going to be using the Llama-2-7B-Chat-GGUF by TheBloke available here.
In the model card we can see that there are many versions of that model, the difference between each version is the quantisation method, you can see the size of each version and how much RAM it needs to operate. I downloaded the llama-2-7b-chat.Q5_K_M.gguf version.

To download a model you must go to the Files and Versions Tab and right-click on the model you want then choose Copy Link.
Now type you can download using the wget command in the terminal like this:

wget https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q5_K_M.gguf

If you don’t have wget already installed, you can install it by simply typing:

sudo apt-get install wget

If you want a better quality of replies and also have the adequate hardware for it, you can download larger models (i.e. 13B and higher).

1.3 Setting up the project files:

Now let’s create a directory for our project.
We can do that by the following command:

mkdir discord-llm-bot

Now let’s access our newly created folder using:

cd discord-llm-bot

We create a virtual environment that we will call env:

virtualenv env

Once created, we need to activate our environment

source env/bin/activate

You will see now that your terminal has a (env) it means that your virtual environment is active.
We will be using Python bindings for LLAMA-CPP. In the Python bindings page, go down to the Web Server section and use the following command to install the Python Bindings:

pip install "llama-cpp-python[server]"

2. Using the Model:

Now let’s load our downloaded model. To do so we use the following command:

python3 -m llama_cpp.server --model ~/llama-2-7b-chat.Q5_K_M.gguf

Here I am using the llama-2-7b-chat.Q5_K_M.gguf model, which I downloaded to my home folder hence the ~ in the path. So the command above must point to the gguf file of the model you downloaded. After running that command you should get a result on the terminal similar to this:

INFO:     Started server process [3414]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://localhost:8000 (Press CTRL+C to quit)

Let’s now navigate to http://localhost:8000/docs to the OpenAPI schemas.
Since we are building a ChatBot what interests us is the POST request for creating chat completion.
Click on /v1/chat/completions and copy the code from the Example Value field.

The code is:

{
  "messages": [
    {
      "content": "You are a helpful assistant.",
      "role": "system"
    },
    {
      "content": "What is the capital of France?",
      "role": "user"
    }
  ]
}

Go back to your terminal and in a new tab (because the first one is still running your model, do not touch it) create a new file in your project directory, name it payload.json.
I use neovim, you can use your preferred text editor.

nvim payload.json

paste the code you copied from the docs into the newly created json file, there are two message objects in the code we copied, one where the role is system and the other where the role is user.
We won’t need the one for system since we are not aiming to modify the model’s behavior so let’s delete it.
The code should now look like this:

{
  "messages": [
    {
      "content": "What is the capital of France?",
      "role": "user"
    }
  ]
}

for the user message we need the change the query string to match the prompt supported by our model, we can do this by adding ### Response: at the end of the content field.

The code should look like this:

{
  "messages": [
    {
      "content": "What is the capital of France? ### Response: ",
      "role": "user"
    }
  ]
}

We are good to go! Save and Close your file ESC+:qw if you are using vim/neovim ;)
Now let’s test our model. Write the following command in the terminal

curl http://localhost:8000/v1/chat/completions -d @payload.json -H "Content-Type: application/json"

It may take a while to give a result Then it should outputs you a response like this:

{"id":"chatcmpl-e219bd31-4515-4f1c-bc57-ea0d4e4f4907","object":"chat.completion","created":1699287163,"model":"/home/llama-2-7b-chat.Q5_K_M.gguf","choices":[{"index":0,"message":{"role":"assistant","content":" The capital of France is Paris."},"finish_reason":"stop"}],"usage":{"prompt_tokens":35,"completion_tokens":7,"total_tokens":42}}

It may take a while to give a result Then it should outputs you a response like this:

{"id":"chatcmpl-e219bd31-4515-4f1c-bc57-ea0d4e4f4907","object":"chat.completion","created":1699287163,"model":"/home/llama-2-7b-chat.Q5_K_M.gguf","choices":[{"index":0,"message":{"role":"assistant","content":" The capital of France is Paris."},"finish_reason":"stop"}],"usage":{"prompt_tokens":35,"completion_tokens":7,"total_tokens":42}}%

We can see in the content field that our model got the right response " The capital of France is Paris."
We can see this more clearly if we format the output of our model using jq by piping it at the end of the curl request like this:

curl http://localhost:8000/v1/chat/completions -d @payload.json -H "Content-Type: application/json" | jq

You will get a better formatted and more readable result like this one:

{
  "id": "chatcmpl-a326fe62-8e64-4d12-b21d-84eecb18fe71",
  "object": "chat.completion",
  "created": 1699287412,
  "model": "/home/llama-2-7b-chat.Q5_K_M.gguf",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": " The capital of France is Paris."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 35,
    "completion_tokens": 7,
    "total_tokens": 42
  }
}

Now let’s ask it for something more complex, let’s go back and edit our prompt in the payload.json file:

{
    "messages": [
        {
            "content": "Tell me what quantum physics is about? ### Response: ",
            "role": "user"
        }
    ]
}

Save the file and run the same curl command again (use the up arrow in keyboard the retrieve old commands in the terminal)

curl http://localhost:8000/v1/chat/completions -d @pl.json -H "Content-Type: application/json" | jq

The output is:

{
  "id": "chatcmpl-cadcbf3f-e72b-4934-b1c5-5e761547e9ef",
  "object": "chat.completion",
  "created": 1699289318,
  "model": "/home/llama-2-7b-chat.Q5_K_M.gguf",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": " Quantum physics, also known as quantum mechanics, is a branch of physics"
      },
      "finish_reason": "length"
    }
  ],
  "usage": {
    "prompt_tokens": 37,
    "completion_tokens": 16,
    "total_tokens": 53
  }
}

The reply that our model gave (which is found in the content field) is short, the model stopped before giving the whole answer, we can know the reason why by looking at the finish_reason field, we see that the reason is length.
This is due to the maximum token length which is set by default to 16.
We can change this in the body of the query by editing the payload.json like this:

{
    "max_tokens":512,
    "messages": [
        {
            "content": "Tell me what quantum physics is about? ### Response: ",
            "role": "user"
        }
    ]
}

Save and run the curl command again, now we have a longer (and much slower) response:

{ “id”: “chatcmpl-17562ec9-6e71-4b0b-8057-b63db37b165f”, “object”: “chat.completion”, “created”: 1699290401, “model”: “/home/llama-2-7b-chat.Q5_K_M.gguf”, “choices”: [ { “index”: 0, “message”: { “role”: “assistant”, “content”: ” Quantum physics, also known as quantum mechanics, is a branch of physics that explores the behavior of matter and energy at the smallest scales, at which the classical laws of physics no longer apply. At these scales, the principles of quantum mechanics govern the behavior of particles and systems, leading to phenomena such as superposition, entanglement, and wave-particle duality.physics is based on the idea that particles, such as electrons and photons, can exist in multiple states simultaneously, a concept known as superposition. This means that a quantum particle can be in more than one place at the same time, or have more than one set of properties, such as spin or energy, at the same time.fundamental aspect of quantum physics is entanglement, which occurs when two or more particles become connected in such a way that their properties are correlated, regardless of the distance between them. This means that if something happens to one particle, it will instantly affect the other entangled particles, regardless of how far apart they are.one of the most counterintuitive aspects of quantum physics is wave-particle duality. According to this principle, particles can exhibit both wave-like and particle-like behavior depending on how they are observed. This means that a quantum particle can display properties of a wave, such as diffraction and interference, or properties of a particle, such as having definite position and momentum.physics has many practical applications in technology, including transistors, lasers, and computer chips. It also has the potential to revolutionize fields such as medicine, energy production, and materials science. However, quantum physics is still an area of ongoing research and debate, with many unanswered questions and unresolved paradoxes.summary, quantum physics is a branch of physics that explores the behavior of matter and energy at the smallest scales, where the classical laws of physics no longer apply. It is based on principles such as superposition, entanglement, and wave-particle duality, which lead to counterintuitive phenomena and have many practical applications in technology and other fields.” }, “finish_reason”: “stop” } ], “usage”: { “prompt_tokens”: 37, “completion_tokens”: 453, “total_tokens”: 490 } }

2.1 chat.py

Writing curl requests each time is tedious, let’s write a script to do make life easier for us:

For our script we will use the aiohttp. The quickstart example is enough for what we want to do.
Let’s copy the imports and the first example into our new script file that we will create in our project directory and we will call it chat.py:

nvim chat.py

Now when we paste the imports and first example our file is like that:

import aiohttp
import asyncio
async def main():
    async with aiohttp.ClientSession() as session:
        async with session.get('http://httpbin.org/get') as resp:
            print(resp.status)
            print(await resp.text())

asyncio.run(main())

We must modify few things to make this work: First we create the following variables: must use a POST request instead of a GET request in line 5.

API_URL = 'http://localhost:8000/v1/chat/completions'
payload = {
    "max_tokens":512,
    "messages": [
        {
            "content": "Tell me what quantum physics is about? ### Response: ",
            "role": "user"
        }
    ]
}
headers = {"Content-Type":"application/json"}

Then we need to use a POST request instead of a GET request and we also need to specify our API_URL, headers, and payload like this:

async with session.get(API_URL, data = payload, headers = headers) as resp:

Now in order for our payload to get parsed correctly we need to dump it as a string for the data argument so we import json and use data = json.dumps(payload) method to do so.

Our chat.py should look like this:

import json
import aiohttp
import asyncio
API_URL = 'http://localhost:8000/v1/chat/completions'
payload = {
    "max_tokens":512,
    "messages": [
        {
            "content": "What is the Capital of Japan? ### Response: ",
            "role": "user"
        }
    ]
}
headers = {"Content-Type":"application/json"}

async def main():
    async with aiohttp.ClientSession() as session:
        async with session.post(API_URL, data = json.dumps(payload), headers = headers) as resp:
            print(resp.status)
            print(await resp.text())

asyncio.run(main())

(I changed the prompt because the one about Quantum Physics takes so much time to respond.) let’s save it and run it:

python3 chat.py

this gives us the following result:

{"id":"chatcmpl-f619a4ee-8509-4029-8b31-20b65db84ffd","object":"chat.completion","created":1699293282,"model":"/home/llama-2-7b-chat.Q5_K_M.gguf","choices":[{"index":0,"message":{"role":"assistant","content":" The capital of Japan is Tokyo."},"finish_reason":"stop"}],"usage":{"prompt_tokens":35,"completion_tokens":7,"total_tokens":42}}

Let’s modify our script to print just the desired reply and also so that we don’t have to edit our script each time we want to use a different prompt.

Let’s first see what the script would look like and then explain our changes.

import json
import aiohttp
import asyncio

API_URL = 'http://localhost:8000/v1/chat/completions'
headers = {"Content-Type":"application/json"}

async def main():
    prompt = input("User: ")
    payload = {
        "max_tokens":512,
        "messages": [
            {
                "content": f"{prompt} ### response: ",
                "role": "user"
                }
            ]
        }
    async with aiohttp.ClientSession() as session:
        async with session.post(API_URL, data = json.dumps(payload), headers = headers) as resp:
            reply = await resp.json()
            reply_content = reply["choices"][0]["message"]["content"]
            print(reply_content)

asyncio.run(main())
  • First we moved the payload variable into the main() function.
  • Then we created a new variable prompt that takes the input of the User and uses it as a prompt, you can see that this variable is used in the content field of the payload (line 14).
  • We modified the print statement in the session (line 21), it is now a variable called reply, this variable contains the whole reply.
  • We proceed to take only what matters to us (line 22) which is the text in the content field, and we put it in the reply_content variable.
    Now let’s save chat.py and run it again with python3 chat.py
    I gave it the prompt Give me 3 ancient African cities and the output was:
 Sure! Here are three ancient African cities that were important centers of trade, culture, and civilization:

1. Memphis, Egypt - Founded around 2925 BCE by the pharaoh Narmer, Memphis was the capital of ancient Egypt and one of the most important cities in the ancient world. It was located on the west bank of the Nile River and was known for its impressive architecture, including the Great Sphinx and the Pyramids of Giza.
2. Axum, Ethiopia - Axum was a major center of trade and commerce in ancient Africa, located in what is now modern-day Ethiopia. Founded around 100 CE, it was known for its advanced agriculture, architecture, and engineering. The city was also an important hub for the spread of Christianity throughout East Africa.
3. Timbuktu, Mali - Located in what is now modern-day Mali, Timbuktu was a major center of trade and learning in West Africa during the medieval period. Founded around 1100 CE, it was known for its universities and libraries, which were renowned throughout the Islamic world. The city was also an important center for the production of books and manuscripts.
These three cities were all major centers of trade, culture, and civilization in their respective regions during ancient times, and they played important roles in shaping the history of Africa and the wider world.

2.2 Having a conversation:

When you run chat.py you will remark that the conversation ends after the first response, this is not ideal for a chatbot. Let’s modify chat.py to have longer conversations.

import json
import aiohttp
import asyncio

API_URL = 'http://localhost:8000/v1/chat/completions'
headers = {"Content-Type":"application/json"}

async def main():
    while True:
        prompt = input("User: ")
        payload = {
        "max_tokens":512,
        "messages": [
            {
                "content": f"{prompt} ### response: ",
                "role": "user"
            }
                    ]
        }
        async with aiohttp.ClientSession() as session:
            async with session.post(API_URL, data = json.dumps(payload), headers = headers) as resp:
                reply = await resp.json()
                reply_content = reply["choices"][0]["message"]["content"]
                print(f"Bot: {reply_content}")

asyncio.run(main())

We wrapped our code in a while statement so that we can maintain a conversation for as long as we desire, we also modified the print statement of the reply to show that it’s the Bot talking.
Now run chat.py again and you can chat as long as you want. You can use CTRL + C to terminate the program. Here is an example of chat session I had with the bot:

User: What is the largest city in Africa?
Bot:  The largest city in Africa is Lagos, Nigeria. With a population of over 21 million people, Lagos is not only the largest city in Africa but also one of the fastest-growing cities in the world. It is located in the southwestern part of Nigeria and is known for its vibrant culture, diverse economy, and rich history.
User: What is the second largest one?
Bot:  The second largest planet in our solar system is Jupiter.

As you can see we can now chat with our Bot but unfortunately he can’t keep track of the context of the conversation. Let’s fix that.

2.3 Handling context:

Our model is not following the context of the conversation, this is because the OpenAPI implementation (which we are using here) requires the whole conversation to be passed as a prompt in each request. You can read more about this here. Now let’s modify our chat.py script:

import json
import aiohttp
import asyncio

API_URL = 'http://localhost:8000/v1/chat/completions'
headers = {"Content-Type":"application/json"}

payload = {
"max_tokens":512,
"messages": []
}
async def main():
    while True:
        prompt = input("User: ")
        msg = {
            "content": f"{prompt} ### response: ",
            "role": "user"
        }
        payload["messages"].append(msg)
        async with aiohttp.ClientSession() as session:
            async with session.post(API_URL, data = json.dumps(payload), headers = headers) as resp:
                reply = await resp.json()
                reply_content = reply["choices"][0]["message"]["content"]
                print(f"Bot: {reply_content}")
        msg_idx = payload["messages"].index(message)
        payload["messages"][msg_idx]["content"] += reply_content


asyncio.run(main())

First we took out payload out of the while loop (line 8 to 11), it has now an empty list in the messages field that we are going to fill with our conversation as it progresses.
We then created a msg variable (line 15) which is going to contain the prompt the user just typed. This msg will be then appended to the payload’s messages field (line 19). The rest of the code is similar and will proceed to give a response i.e reply_content. Now we need to take this response and append its corresponding prompt in the payload’s messages field. We can do that by first getting the index of our actual message (line 25), then we will add that response to the prompt (line 26).

Now save and run the script again. We can now see that our model can keep up with us:

User: What is the largest city in Africa?
Bot:  The largest city in Africa is Lagos, Nigeria. It has a population of over 20 million people, according to estimates in 2020.
User: What is the second largest one?
Bot:  The second largest city in Africa is Cairo, Egypt. It has a population of approximately 20 million people, according to estimates in 2020.

Now we are good to go. Let’s put everything together and create a Discord Bot.

3. Putting everything together:

3.1 Creating a Discord App:

Now in the General Information give a name to your bot and a description if you desire, you can also upload a profile picture for it.

  • Click Save Changes

Now on the left go to Bot section

  • Click on Reset Token button (confirm the message)
  • Click Copy to copy it

Now go back to your project directory and create a new file called .env and paste your token in it with TOKEN= in the beginning.
You .env file should look like this:

TOKEN=<HERE IS WHERE YOU WILL PAST YOUR TOKEN>

Save and close this file.
Now back on the Bot configuration page you need to changes two things: - Toggle off the Public Bot button. - Toggle on the MESSAGE CONTENT INTENT button. As shown here Screenshot

Now go to the OAuth2 tab, we need to generate a URL that will help us invite our bot to our server. - Go to URL Generator and tick the bot box. As shown in the following screenshot screenshot bot - Now tick the following permissions: > - Send Messages > - Create Public Threads > - Create Private Threads > - Send Messages in Threads > - Send TTS Messages > - Read Message History > - Use External Emojis > - Use External Stickers > - Add Reactions

As shown in the following screenshot screenshot permissions

Now go to the bottom and copy the generated URL and open it in a new browser tab

If this is your first time using Discrod, you must create a server to add your Bot to, you can do that by following this tutorial.
Once you have a server in your account
- Open the generated URL copied previously in a new tab.
- Select your server.
Screen server
- Click Continue - Authorize the permissions again - Prove your Humanity! (easier said than done :( )

Now you are in the server with your bot!
Screenshot server with bot

Our Discord Bot is created, now all we have left to do is to ‘connect’ it with our model.

3.2 LLM Discord Bot:

Now we need to write a script that would connect our Discord Bot to our local LLM model.

Let’s go back to the terminal and create a copy of the chat.py and name it bot.py because we will need most of the code we wrote previously.

cp chat.py bot.py
  • Open bot.py.

For the code of our bot we will use an example from discord.py repository which can be found in this link: https://github.com/Rapptz/discord.py/blob/master/examples/reply.py

  • Copy all the code in reply.py and paste it at the end of your bot.py file
  • Now move import discord to the top
  • Remove the original asyncio.run(main())
  • Remove the comment from the example.
  • Move all the code in the while loop under await message.reply('Hello!', mention_author=True) and indent it properly
  • Move await message.reply('Hello!', mention_author=True) under print(f"Bot: {reply_content}") and indent it properly
  • Delete print(f"Bot: {reply_content}")
  • Edit await message.reply('Hello!', mention_author=True) to await message.reply(reply_content, mention_author=True)
  • Delete the async def main function
  • In if message.content.startswith('!hello'): replace !hello with !bot
  • Replaceprompt = input("User: ") with stripped_msg = str(message.content).replace('!bot','').strip()
  • Now in the msg variable, change {prompt} to {stripped_mgs}

We are almost done, we need to add our token that we put in the .env file earlier, to do so we need to add the following code to the top of the file:

import os
from dotenv import load_dotenv
load_dotenv()
TOKEN = os.getenv("TOKEN")

and at the end of the file edit client.run('token') to client.run(TOKEN)
Your bot.py file shoud look like this:

import json
import aiohttp
import asyncio
import discord
import os

from dotenv import load_dotenv 
load_dotenv()
TOKEN = os.getenv("TOKEN")

API_URL = 'http://localhost:8000/v1/chat/completions'
headers = {"Content-Type":"application/json"}

payload = {
    "max_tokens":512,
    "messages": []
}

class MyClient(discord.Client):
    async def on_ready(self):
        print(f'Logged in as {self.user} (ID: {self.user.id})')
        print('------')

    async def on_message(self, message):
        # we do not want the bot to reply to itself
        if message.author.id == self.user.id:
            return

        if message.content.startswith('!bot'):
            stripped_msg = str(message.content).replace('!bot','').strip()
            msg = {
                "content": f"{stripped_msg} ### response: ",
                "role": "user"
            }
            payload["messages"].append(msg)
            async with aiohttp.ClientSession() as session:
                async with session.post(API_URL, data = json.dumps(payload), headers = headers) as resp:
                    reply = await resp.json()
                    reply_content = reply["choices"][0]["message"]["content"]
                    await message.reply(reply_content, mention_author=True)
            msg_idx = payload["messages"].index(msg)
            payload["messages"][msg_idx]["content"] += reply_content 


intents = discord.Intents.default()
intents.message_content = True

client = MyClient(intents=intents)
client.run(TOKEN)

Save it and run it, you should a message like this in the terminal:

2023-11-07 17:10:33 INFO     discord.client logging in using static token
2023-11-07 17:10:34 INFO     discord.gateway Shard ID None has connected to Gateway (Session ID: ff89fsdf2342502081fe39).
Logged in as llm-bot#8912 (ID: 1173453534535)
------
python bot.py

Go back to your Discord Client, you will see that the bot is now Online.
To interact with the bot you must start your message with !bot that’s the flag that will prompt the bot to answer.

Congratulations you have now a working LLM Discord Bot!
final screenshot

You can find the code used in this tutorial in my GitHub Repository
If you have any questions feel free to Contact Me