What makes a great Natural Language Understanding Engine? For starters, the ability to reliably identify the correct user intent in a given input. At the same time, the machine should make very few “false positive” mistakes – the error of incorrectly finding a given intent when possibly no intent was expressed at all.
How to compare NLU engines?
But reliability and accuracy are not the only things that make a good NLU engine: Training an NLU can take a lot of time and effort. Consequently, the fewer examples needed to train the machine the better and few shot learning abilities of the NLU should be considered.
A method to assess and compare the NLU is to test a trained model on new inputs it has not seen before. A suitable approach is to construct a hold-out test set of utterances through random selection where the correct intent classification is part of the dataset.
More training = higher performance
To evaluate and distinguish few-shot learning ability NLU may be trained on only a handful of example sentences. The fewer sentence the machine has to train the worse one can expect it to perform – one would want to see whether performance is still useful in practice and rule out behavior where performance drops off like a cliff and see whether the NLU maintains a useful standard.
To conduct a benchmark test without human biases in the data set we at Cognigy use an independent data set compiled by researchers at Heriott-Watt University. It contains more than 10,000 utterances around home automation. Details on the research are published in the paper “Benchmarking Natural Language Understanding Services for Building Conversational Agents (2019). Their data is available on Github.
Creating an unbiased benchmark
In our test we used the data from Heriott-Watt on NLU platforms Microsoft LUIS, Google Dialogflow and IBM Watson to compare it against Cognigy NLU.
In detail, for 64 different intents we randomly picked 10 example sentences and used them to train the NLU. We then tested 1076 examples not in the training set. The process is visualized in the graphic.
To compare results for different numbers of training sentences we constructed a second scenario with 30 input sentences.
All raw data is available on Github and can be used to replicate the measurements.
What are the results?
Here are the results (all tests performed August 2020):
What do the numbers mean?
An accuracy score of .751 means that roughly 75% of test sentences were matched to the correct intent. As the data set is purposely designed to be a great challenge to state-of-the-art NLU engines, 75% is a pretty good result. There are many overlapping and challenging intents and most of the time the NLU well understands the correct topic such as music but merely fails to distinguish whether the user wants to turn it off or on, skip a song etc. This is one of the reasons Cognigy introduced Intent Hierarchies where intents can be ordered by semantic topic and one can resolve such hierarchical recognition challenges.
We repeated the process with about 30 example sentences per intent and a total of 5518 test sentences.
Unsurprisingly the intent recognition improves with more training data. However, in a real-life scenario one would have to write almost 3 times as many example sentences to surpass the accuracy that was already achieved with 10 training sentences.
Zooming into the data
Let’s take a look at the detail level and focus on one specific example from the data set.
The example sentence is: “are there any tornado warnings today”. The true intent that all engines should recognize is “weather_query” - a user asking for the weather forecast.
Here are individual results for the NLU engines mentioned above:
The table depicts the recognized intent and score for the example sentence. Like in the previous example: A higher score reflects better accuracy. DialogFlow, LUIS and Watson predict the wrong intent and they do so with relatively modest confidence correctly indicating the uncertainty of the model. In this particular example, Cognigy's recognition of the intent is correct and the confidence is relatively high.
A look behind the scenes
Although the nature of and randomness ingrained in machine learning algorithms should caution us, we might venture an interpretation of the results in this case: Clearly, LUIS has not much to go with and likely associates “today” with a calendar query. Watson and DialogFlow in contrast interpret the phrase along the lines of “any traffic jam warnings today for me?” which seems sensible enough but ignores the reference to “tornado"- (which is not in the test set) and would have to be familiar somehow to the NLU.
What drives Cognigy's performance?
The Cognigy NLU does pick up on tornado in contrast. Moreover, not only does it get the weather association - a capability without which the results from other NLU vendors could not be explained. It also weighs the importance of this word and competing signals for other intents in context to give the correct outcome in this instance.
Note the importance is in this instance as the technology is far away from a deep concept of a tornado, weather and the like. Its non-linear workings are the result of a neural network, however, which are getting more and more able to capture elements of rich meaning encoded in our language.
Curious about seeing Cognigy NLU in action? Start a free trial and follow our onboarding tutorials to explore Cognigy's leading-edge technology yourself.