AI Outperforms Humans in Question Answering: Demystifying Hype and Reality for IT leaders


During the past few months, you may have come across news and media coverage such as the following:

CNET — AI beats humans in Stanford reading comprehension test

Bloomberg — Alibaba’s AI Outguns Humans in Reading Test

Dailymail — Alibaba’s AI outperforms humans in one of the toughest reading comprehension tests ever created in a remarkable world first

Washington Post — AI models beat humans at reading comprehension, but they’ve still got a ways to go

Wired — AI Beat Humans at reading! Maybe not

The Verge — No, machines can’t read better than humans

Many IT leaders wondered, how would these results affect their near and long term decision making. What could they learn from these studies?

What is it all about?

Automated Question Answering (AQA) and machine comprehension (MC) have gathered a powerful momentum recently with advances in Deep Learning, which became an essential tool for NLP (Natural Language Processing) and NLU (Natural Language Understanding). Intelligent personal assistants like Apple’s Siri and Google Assistant are becoming an indispensable part of user experience, with natural language interfaces enabling users to get answers to their questions and delegate various tasks to AI-powered software.

To support the development of the state-of-the-art Machine Learning (ML) models for AQA and MC, a number of large datasets were created. These include Stanford’s SQuAD for automated question answering, MS Marco for real-world question answering, Trivia QA for complex compositional answers and multi-sentence reasoning, CNN/Daily Mail and Children’s Book Test dataset for cloze-style reading comprehension, and many more. Until very recently, however, the existing models failed to outperform human benchmarks in reading comprehension and question answering.

Then, at the beginning of 2018, we witnessed a dramatic breakthrough: ML models independently developed by Microsoft and Alibaba managed, by a small margin, to beat human performance on the SQuAD dataset developed by Stanford NLP Lab.

Within a month, others followed suit. Joint Laboratory of HIT and iFLYTEK Research, along with a model developed jointly by Microsoft Research Asia & NUDT, reached new heights in the AQA. Although the discussed models haven’t yet managed to beat human performance on all metrics (as seen in the F1 metric), the pace of innovation in the AQA is stunning.

SQuAD Leaderboard as of March 25, 2018

How does Automated Question Answering work?

Technically speaking, AQA models predict the best answer for a query (Q), given a passage (P) or a set of passages that contain the answer to that query. The task of an AQA model is to predict the best candidate answers by studying the passage and query interactively and evaluating various contextual relationships between them. With the exponential growth of Big Data and web documents, open-domain AQA that answers questions based on a large collection of documents is becoming the de facto standard in the field. This replaces earlier models, which used closed-domain question answering through custom-built ontologies dealing with a narrow segment of knowledge. Recently, a number of AI and ML technologies have powered the rapid advances in the open-domain AQA, the most important among them being RNNs and attention-based neural networks.


AQA models use various memory-based neural frameworks like RNNs (Recurrent Neural Networks) and their variants, such as LSTMs. RNNs are networks designed to deal with sequential information, such as sentences where inputs are tightly coupled and/or depend on each other. By storing various parts of the sequence in the network’s memory, RNNs can model the contextual relationship between words, phrases, and sentences to enable better translation, information retrieval, and machine comprehension. RNNs and their subsets have powered the ‘encoder-interaction-pointer’ framework underlying most of the contemporary AQA models. In this framework, word sequences of both query (question) and context (passage) are projected into distributed representations and encoded by recurrent neural mechanisms. The attention mechanism is then used to model the complex interaction between the query and the context. A pointer network may then be employed to predict the boundary of the answer.

Attention-Based Neural Networks

To infer relationships from words that form a sequence, an ML model must convert words and their characters into embeddings, which are vectors of real numbers. These vectors capture lexical proximity between words and phrases in the multi-dimensional language space. This approach, however, is limited by long sentences that require very long vectors. Instead, an alternative approach to dealing with the natural language representation was proposed by Bahdanau, Cho, and Bengio (2016) for machine translation. In this approach, each time the proposed model generates a word in a translation, it soft-searches, or attends, for a set of positions in a source sentence where the most relevant information might be concentrated.

This idea has rough analogies with how humans scan the text to find the answer. Human attention, essentially, focuses on certain parts of the input sentence and context one at a time. For example, when we read the text, we focus on the relevant paragraph and then on the relevant sentence continuously refining the results of our search.

More formally, after searching for the most relevant places in the text, the model predicts a target word based on the context vectors associated with these source positions and all previously generated target words. In this way, there is no need to encode a whole input sentence into a single fixed-length vector. Instead, the model just encodes the input sentence into a sequence of vectors and then selects a subset of these vectors adaptively. This allows the model to better handle long sentences, and even passages, while retrieving better contextual information that can be relevant to the query.

Overview of the SQuAD Dataset and the Task

Three winning AQA models were trained on the Stanford Question Answering Dataset (SQuAD) proposed by Rajpurkar et al. (2016) in the Stanford NLP Lab. The dataset is a high-quality collection of data consisting of 100,000+ questions posed by crowd-workers on a set of randomly collected Wikipedia articles, where the answer to each question is a word or a text span from the corresponding reading passage.

The human and machine performance against the dataset is assessed by two metrics: Exact Match (EM) and F1 score. EM measures the percentage of the prediction points that matches one of the ground truth answers exactly (that is, when the top result was the correct answer). F1 can be thought of as a measure for overlap between the prediction and all ground truth answers. So a F1 of 100% means that the system not only found all answers but also that these answers were the top ranked predictions. As of March 25, 2018, the benchmark human performance on the dataset was 82.304 for EM and 91.2221 for F1.

One of the main benefits of the SQuAD is that it is large enough for data-intensive model training. Also, unlike many other datasets which are semi-synthetic and do not share the same characteristics as explicit reading comprehension questions, SQuAD captures a variety of question types that can be posed. Additional distinctive features of the dataset that might have affected the architecture of the winning models include:

  • SQuAD questions do not require commonsense reasoning and reasoning across multiple sentences.
  • SQuAD involves a span constraint that limits the scope of the answer to a single word or phrase in the passage. This is beneficial because span-based answers are easier to evaluate than free-form answers.
  • Crowd-workers were encouraged to ask questions in their own words, without copying word phrases from the paragraph, to allow for the syntactic diversity of questions regarding paragraph sentences.
  • The authors of SQuAD hypothesized that model performance will worsen with increasing complexity of answer types and with the growing syntactic divergence between the question and the sentence containing the answer.

In general, the SQuAD questions require competing models to account for a number of complex comprehension tasks and contexts, such as the difficulty of questions asked in terms of the type of reasoning required to answer them and the degree of syntactic divergence between the question and answer sentences. The proposed models thus need to account for lexical variations, such as synonymy and world knowledge, and syntactic variations. All three models managed to successfully cope with these challenges.

In a longer version of this article (AI Outperforms Humans in Question Answering: Review of three winning SQuAD systems), we looked at three models that beat humans on the SQuAD dataset while exploring ML technologies which enabled their improved performance in AQA and machine comprehension. Based on the detailed comparison of these models, we evaluated how good they are in addressing real-world consumer needs in the context of online self-service support and digital personal assistance.

We found that the field of the AQA has matured enough for dealing with factual question answering, however, further improvements are needed to address consumer needs in descriptive answers, how-to guides, troubleshooting, and other types of requests requiring complex reasoning rather than simple single-word or span answers.

Strengths and Limitations of the SQuAD Winning Systems

All three models discussed in the aforementioned article dramatically improve state-of-the-art AQA by increasing the rate of correct answers, enhancing understanding of the context, and managing long-term contextual memory. The machine comprehension models for the single-word and span answers represented by these models have matured enough to be used in commercial products. However, use cases addressed in the SQuAD dataset and the winning models do not cover all possible scenarios and requirements of the automated question answering in the context of consumer support, online search, and digital personal assistance. These limitations include:

  1. All three winning systems are “ensemble” systems, not single-model systems. Their focus was on getting right answers, but not on real-time performance or cost of deployment, which are lacking.
  2. All three models are adapted to a situation where the answer always falls into one sentence. In the real world, however, the answer can span beyond one sentence. Dealing with such cases is much more difficult since the existing models for sentence ranking now significantly underperform “answer span” ranking. This indicates that the exact span information is, in fact, critical in selecting the correct answer sentence. This limitation was explicitly mentioned in the R-Net+ authors’ directions for future research and it constitutes a tangible problem to the multi-sentence reasoning in AQA.
  3. Many questions in the SQuAD are quite diverse in terms of reasoning required and syntactical divergence between the question and the answer. For example, the dataset involves question-answer pairs with lexical variation (synonymy and world knowledge), syntactic variation from the passage sentences, and partial multi-sentence reasoning. Still, most questions in the dataset are factual questions, which actually constitute a small part of questions asked by consumers and Internet users. In particular, when it comes to support services, people tend to ask descriptive questions and questions asking for the steps to solve a problem. Such types of questions are relevant for a number of use cases, such as:
  • Support deflection: When a user is opening a ticket, AQA software must be able to render solutions from manuals, user guides, discussion forums, and existing tickets.
  • Guided troubleshooting: When a user describes a problem, the system then guides the user through trying various steps one-by-one, with the user entering findings after each step.

Answering such questions requires deep world knowledge, multi-sentence reasoning, and other approaches that are not currently addressed in the SQuAD dataset and described models. To create even better AQA models, we need a combination of what these winning systems have shown, along with automated question/answer generation and models where answers are constructed from several sentences and/or paragraphs.


What has improved is one tool in the toolkit of production-ready, natural language understanding systems.

Building a question answering system solely using deep learning techniques is still out of question.

Do not expect question answering systems to be 100% accurate. Even in the presumably simpler SQuAD dataset, humans could only score just above 82%.


Subscribe Newsletter

Receive our newsletter to stay on top of the latest posts.

Thank you for signing up!