Question Answering through Retrieval Augmented Generation

8 min readJun 25, 2023

I have been in the search business for a lot of years. The first blog post I can find from myself about Elasticsearch dates back more than ten years. In January 2013, I wrote Learning about ElasticSearch. From the start, I tried to transform the question from a user into an answer. The question was in the format of one or a few words. The answer was in the format of 10 results. The users had to click on the links and find what they were looking for, sometimes using highlighting. We also had to try to overcome typos by the user or alternate words (synonyms). Users of the search service wanted to have a Google-like experience.

As Google used to function only on keywords, we all got used to entering a few words, probably using the auto-complete. But times have changed. Google started recognizing questions and tried to answer the question using content from found pages. Recently, let us say at the end of 2022, something changed. We got introduced to ChatGPT, a service provided by the company OpenAI. With ChatGPT, you can interact with a bot as you would chat with a person. ChatGPT is a chat interface on top of a Large Language Model. ChatGPT generates answers to your questions if you ask the right things at least.

ChatGPT and large language models have some areas for improvement concerning question-answering systems. It is trained on data from the past. It can generate words based on your prompt but is unfamiliar with your specific content. You can provide context to the LLM, but the amount of content or tokens is limited depending on your chosen model.

After reading this blog, you understand the general concept of a question/answering system. You understand how to use Large Language Models to create a question-answering service. You can generate personalized responses to any question your user may have, utilizing the information at your disposal. I use Weaviate to demonstrate multiple solutions: one solution uses locally running models, and one uses the OpenAI integration of Weaviate.

Types of question-answering systems

Q&A systems come in different flavors. In open systems, you provide the context to obtain the answer. The system has its own content to obtain the answer in closed systems. The mechanism to obtain the answer can differ as well. With extractive, the answer is literally found in the context provided or the content known to the system in a closed system. Generative uses the knowledge from the context or the model's content to generate a new answer.

Retrieval Augmented Generation

You can read about open and closed Q&A systems in the previous section. Large Language Models (ChatGPT) can do both. You can ask closed questions or provide your own context. This is important; LLMs are usually trained on data from 1 or 2 years ago. It is unfamiliar with the latest news and cannot access your data. To make this process easier, retrieval augmented generation (RAG) is the new kid on the block. The image below shows you how it works.

It shows how a user asks a question to the RAG. The RAG passes a parsed version of the question to the content system. The content system returns relevant items to the RAG. Then, the RAG passes the question to the Large Language Model (LLM), which returns the answer to the RAG and back to the user. — Interaction between Retrieval Augmented Generation (RAG) and Large Language Models (LLM)

The image shows how a user asks a question to the RAG. The RAG passes a parsed version of the question to the content system. The content system returns relevant items to the RAG. Then, the RAG passes the question to the Large Language Model (LLM), which returns the answer to the RAG and back to the user.

The experiments

For this blog, I want to do three different experiments. I will use the Weaviate question-answering module with transformers in the first experiment. Next, I will use the question-answering module with OpenAI. Finally, I will use the generative OpenAI module and a custom prompt.

The dataset I use is the open data set from the Dutch Government containing the “Vraag antwoord combinaties”. Please refer to the blog I wrote for my employer about importing the content into Weaviate and OpenSearch.

Vector search using Langchain, Weaviate and OpenSearch - Luminis

With the popularity of ChatGPT and Large Language Models (LLM), everybody is talking about it. On my Linkedin home…

www.luminis.eu

Setting up the project

Before I describe the experiments, let me point you to the source code. You can find the repository through the link below. The readme helps you set up the project if you want to run it yourself. For this blog, refer to the file streamlit_rag.py.

GitHub - jettro/MyDataPipeline: Repository containing my experiment to play with a data pipeline

Repository containing my experiment to play with a data pipeline - GitHub - jettro/MyDataPipeline: Repository…

github.com

The GUI makes use of Streamlit. You run the Streamlit application using the following command.

streamlit run streamlit_rag.py

If you prefer to run it from PyCharm, you can create a runner like the following image. That way, you can also use the debugger.

Run configuration for Streamlit in PyCharm

Experiment 1: Using the Weaviate QnA module with a custom model

This experiment uses Docker to run a local instance of Weaviate. You can find a docker-compose file in the folder infra with the name docker-compose-weaviate.yml. The file shows three docker containers. One for the Weaviate instance, one for the transformer module, and one for the qna module. The qna module is slightly different; you must build it yourself. Refer to the Docker file mdeberta.Dockerfile.

The main reason for using the different Docker images is to support the Dutch language. The qna module uses the image timpal0l/mdeberta-v3-base-squad2 from Hugginface.

The next code block shows how to build the Docker image and how to run the complete setup.

docker build -f mdeberta.Dockerfile -t mdeberta-qna-transformers .
docker compose -f docker-compose-weaviate.yml up -d

The sample does not work without content. You can use the script run_weaviate_qna.py to load the content. The time it takes to load the content depends on your machine. On my 4-year-old MacBook, it took around 15 minutes.

The next image is an overview of the solution space for this experiment.

RAG implementation using Weaviate qna module with a custom transformer

The next code block shows the code to send a query to Weaviate. The get part specifies what to return in the response. The with_ask part specifies the question and the fields for searching for the answer.

def q_and_a_transformer(client: WeaviateClient, question: str):
    ask = {
        "question": question,
        "properties": ["question", "antwoord"],
        "rerank": True
    }

    return (
        client.client.query
        .get("RijksoverheidVac", [
            "question",
            "antwoord",
            "_additional {answer {hasAnswer certainty property result startPosition endPosition} }"
        ])
        .with_ask(content=ask)
        .with_limit(1)
        .do()
    )

Shows the screen with the question and found answer from Weaviate — Screen with the question and the found answer from Weaviate

The answer is extracted from the found answer. The answer might sound weird. However, the best matching question answer might not be what you expect. It can help to look at the used question and answer. The next image shows the screen with the matched question-and-answer combination.

Screen with the found question-and-answer to extract the answer from

We do not get an answer to the question “Hoeveel alcohol mag ik gebruiken als ik nog moet autorijden”. There is a matching document that does contain the answer. Let us go through the other experiments to find an answer.

Experiment 2: Using the Weaviate OpenAI QnA module

This is a generative model. It uses the provided context to generate a new piece of text. The setup is different. The biggest difference is that we now use Weaviate Cloud Services. You can create a sandbox environment for free. It runs for 14 days, enough time to do some experimentation. You can use the script run_langchain_ro_vac.py to insert the content. When inserting content, you must configure the modules you want to use. You can find the schema in the folder config_files and the file rovac_weaviate_schema.json.

The next image shows an overview of the solution. After that, a code block with the Python query.

RAG implementation using Weaviate qna module with OpenAI

def q_and_a_openai(client: WeaviateClient, question: str):
    ask = {
        "question": question,
        "properties": ["question", "antwoord"]
    }

    return (
        client.client.query
        .get("RijksoverheidVac", [
            "question",
            "antwoord",
            "_additional {answer {hasAnswer property result startPosition endPosition} }"])
        .with_ask(content=ask)
        .with_limit(1)
        .do()
    )

With the new qna-openai module the result for the question “Hoeveel alcohol mag ik gebruiken als ik nog moet autorijden” improves.

Als je nog moet autorijden, mag je niet meer dan 0,5 promille alcohol in je bloed hebben.

Experiment 3: Using the Weaviate OpenAI Generative module

We use the generative-openai module from Weaviate in this third and final experiment. As the name suggests, we use the generative model. With this module, we can provide our own prompt to interact with OpenAI.

RAG implementation using Weaviate generative module with OpenAI

def generative_openai(client: WeaviateClient, query: str = "enter your query"):
    prompt = f"""Use the following pieces of context to answer the question at the end. If you don't know the 
    answer, just say that you don't know, don't try to make up an answer.

    {{antwoord}}

    Question: {query}
    Answer in Dutch:"""

    return (
        client.client.query
        .get("RijksoverheidVac", ["question", "antwoord"])
        .with_generate(single_prompt=prompt)
        .with_near_text({
            "concepts": [query]
        })
        .with_limit(5)
        .do()
    )

Will the result for the question“Hoeveel alcohol mag ik gebruiken als ik nog moet autorijden” improve? To my opinion it does.

Als beginnend bestuurder mag u niet rijden met meer dan 0,2 promille alcohol in uw bloed. Als ervaren bestuurder mag u niet rijden met meer dan 0,5 promille alcohol in uw bloed.

This answer is the most complete answer of the three experiments.

Some takeaways

It helps a lot to ask the right question. There is a risk that the wrong document matches and the generated answer is not what you are looking for. A nice example is the difference between the following two sentences: “Hoeveel alcohol mag ik gebruiken als ik nog moet autorijden” en “Hoeveel mag ik drinken als ik moet rijden”. The second sentence finds a document about drinking and riding a bike more important.

In most sentences I have tried, the setup with the generative OpenAI module gives the best result. You can still customize the prompt to your liking. The disadvantage is the performance (latency) and the cost. Both are a lot higher than the other setups.

As a final remark, I found it easy to experiment with the three different mechanisms to implement a question-and-answering solution with Weaviate. And I liked trying out Streamlit to create user interfaces.

Contact me if you have questions or need help with your own question-answering systems. You can visit my employer's website to learn more about what I am doing these days.

Luminis | Software Technology | Cloud, IoT and Data solutions

Luminis is a software technology company. We're dedicated to developing technology ahead of demand, to generate…

luminis.eu