QA in Python
13 Oct 2021 11 min to read

Question Answering (QA) System in Python – Introduction to NLP & a Practical Code Example

Question Answering (QA) is a branch of the Natural Language Understanding (NLU) field (which falls under the NLP umbrella). It aims to implement systems that, given a question in natural language, can extract relevant information from provided data and present it in the form of natural language answer.

For example, after being asked, “how warm is it going to be today?” your Siri can extract raw information about today’s temperature from a weather service. In addition, instead of showing it to you as is, it processes the data and presents it to in proper English (or in any other supported language).

Historically, one of the first implementations of the QA system was the program BASEBALL (1961), created at Stanford University. It was able to answer questions about baseball league scores, statistics etc., using a rule-based language model for “decoding”, generation of natural text and access to a baseball relational database for finding the actual answers.

Although, as fun and retro as the example above may seem, it is hard to imagine it being more valuable than just having this baseball data in a spreadsheet. However, today most of the data that we produce as a society is not structured in a single table like baseball game scores. Instead, the data is unstructured, similar to images, audio recordings, social media behavioural data and (most importantly for this article) natural text. That is why the focus of today’s field of QA is shifted from generating answers in natural language (we have big language models like GPT-3 BERT for that now) towards extracting factual information from unstructured data.

Different types of QA

Every QA system can be categorized based on its two criteria:

  • domain
  • type of answers

Domain criterion

Single domain systems

These are systems which are fine-tuned for answering questions from one specific domain. Think about a program that answers patients’ questions about heart diseases or another one that mines information in an internal company’s data for an executive officer. This kind of system has the advantage of inconsistencies in natural language. For example, e.g. the word “heart” in “heart disease” will always mean an actual human organ instead of a duck heart in British cooking recipes.

Open-domain systems

These are a natural extension of single domain QA systems. Instead of focusing only on one narrow area of expertise, they are designed to answer more general questions. Think voice assistants or a model trained on all the Wikipedia articles.

Answer type criterion

Yes/No answers

This is the most straightforward instance of a QA system. It basically boils down to the classification of text based on question and context data.

Extractive question answering

In this approach, instead of creating a novel natural language answer, the system simply finds and returns a fragment of analyzed text containing an answer. These kinds of systems are robust against errors in text generation (they simply ignore it altogether). On the other hand, they can struggle if the answer wasn’t provided in the text directly yet implied between the lines.

Generative question answering

The most complex type of QA system that, for every question, generates novel answers in natural language. Unfortunately, it requires much more computing power as well as engineering time in comparison to the extractive approach.


For this tutorial, we’ll be implementing a simple QA system training by working on open domain data and yielding yes/no answers using 🤗 transformers and PyTorch in Python. However, instead of training the model from scratch, we’ll fine-tune an already existing model (DistilBERT) for our task.

We’ll use the BoolQ dataset. It contains 12697 examples of yes/no questions, and each example is a triplet of a question, an answer and context (textual data based on which system will answer).

Since we are working with yes/no questions, our goal is to train a model that performs better than just picking an answer at random – this is why we must aim at >50% accuracy.


For starters, we need to install the required python packages, that is, PyTorch, sklearn, transformers and datasets. Note that these commands may not work for your setup. If you have any problems with them, please refer to PyTorch/huggingface installation guides.

pip install torch
pip install datasets transformers
pip install sklearn

After necessary installations, we can open our script/jupyter/collab and start with essential imports.

from datasets import load_dataset, load_metric
from transformers import DistilBertTokenizerFast
from transformers import AutoModelForSequenceClassification, DataCollatorWithPadding
from transformers import Trainer, TrainingArguments

Additionally, we need to specify a checkpoint of the model we want to fine-tune. In our case, it’ll be DistilBERT, which is a smaller, faster and lighter, yet still high performing version of the original BERT model.

checkpoint = "distilbert-base-uncased"

Data Preparation

First of all, we need to download our data. Fortunately, we can do it directly from 🤗 hub (link) by simply executing:

dataset = load_dataset("boolq")

where “boolq” is the name of our dataset

train: Dataset({
features: ['question', 'answer', 'passage'],
num_rows: 9427
validation: Dataset({
features: ['question', 'answer', 'passage'],
num_rows: 3270

Above, we can see the structure of this particular dataset. It consists of two subsets: `train` and `validation`. Below we can see a single example:

{'answer': True,
'passage': 'Persian (/ˈpɜːrʒən, -ʃən/), also known by its endonym Farsi (فارسی fārsi (fɒːɾˈsiː) ( listen)), 
is one of the Western Iranian languages within the Indo-Iranian branch of the Indo-European language family. 
It is primarily spoken in Iran, Afghanistan (officially known as Dari since 1958), and Tajikistan 
(officially known as Tajiki since the Soviet era), and some other regions which historically 
were Persianate societies and considered part of Greater Iran. It is written in the Persian alphabet, 
a modified variant of the Arabic script, which itself evolved from the Aramaic alphabet.',
'question': 'do iran and afghanistan speak the same language'}

To begin data processing, we need to create a text tokenizer. Luckily, 🤗transformers come equipped with a bunch of already pretrained tokenizers, which we can use out of the box:

tokenizer = DistilBertTokenizerFast.from_pretrained(checkpoint)

Since we already defined `tokenizer`, we can now define a function that will perform an actual preprocessing. This function must tokenize and encode input with tokenizer as well as prepare labels field.

def tokenize_function(example):
encoded = tokenizer(example["question"], example["passage"], truncation=True)
encoded["labels"] = [int(a) for a in example["answer"]]
return encoded

tokenized_datasets =, batched=True)

Additionally, we need to define a data collator, which will create batches of examples that don’t have the same length.

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Model definition and training

Using pre-trained models with 🤗transformers is really easy. With only one line of code, we can download the weights of the model we want to fine-tune.

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

Our next step is to define training arguments:

args = TrainingArguments("roberta-booql",

Note that the parameters above are not just an example. You may want to increase batch size and number of epochs if you have access to a more powerful machine than collab’s standard K80, which I used.

The last thing we have to do before training is putting all the objects we defined earlier together into an instance of a Trainer class

trainer = Trainer(model,

Now we can begin training by executing:



After training (it took me ~20min to complete), we can evaluate our model. To do that, we’ll generate predictions for validation subset:

predictions = trainer.predict(tokenized_datasets["validation"])
y_pred = predictions.predictions.argmax(-1)
labels = predictions.label_ids

Now we’ll load the `accuracy` metric:

metric = load_metric("accuracy")

And finally, we have our performance:

metric.compute(predictions=y_pred, references=predictions.label_ids)
{'accuracy': 0.7327217125382263}

Not bad, accuracy 73% certainly have a place for improvement. But still, for natural language understanding in 20 min, it’s a good start.



The model that we trained in this tutorial may not be the next big thing that redefines our views on AI (outside of trivia nights), but it certainly demonstrates the shift in perspective brought by big transformer-based models like BERT or GPT-3. Not so long time ago, the only feasible way to implement any QA functionality to your system was to tirelessly build a rule-based program that would work only for a set of predefined questions.

Today, if you have data, you can quickly make the solution that leverages the most recent advancements in NLU with minor to none rule-based methods. The most obvious example of today’s QA systems is voice assistants developed by almost all tech giants like Google, Apple and Amazon that implement open-domain solutions with text generations for answers. Single-domain systems performing extractive QA (e.g. like this one) are also getting some traction, but of course, their use cases are much more niche. Additionally, as high-performance language models are getting more accessible outside of “big tech”, we can expect much more instances of QA systems in our everyday life.


Call to action
If you are looking for a team of Python specialists for your projects, take advantage of our free consultation. Tell us about your business, and we will suggest the best technology solution.



Are you interested in news from the world of software development? Subscribe to our newsletter and receive a list of the most interesting information.