|Participating chatbots came in all shapes and sizes|
Unlike the previous six times that I entered my AI Arckon*, this year’s Loebner Prize left me emotionally uninvested from start to finish. In part because I’ve grown more jaded after each attempt, but with the removal of both prize money and the challenging qualifying round, there wasn’t really anything at stake and I had no idea what to prepare for. At the same time the exhibition offered exactly what I had wanted: A public demonstration of my AI’s abilities. So instead of trying to outdo other chatbots at appearing human, I focused on making a good impression on visitors. I mostly spent time setting up procedures to deal with misunderstandings, common expressions, conversational routines, and teaching Arckon more about himself to talk about. Those aspects would come into play far sooner than intelligence.
22000 lines of code, 3800+ vocabulary, 9000+ facts
Most conversations with visitors were the kind of small talk you would expect between two total strangers, or just kids being silly (240 school children had been invited, aged 9 to 14). People typically entered only one to four words at a time, and few could be bothered to use punctuation. Of course half the time Arckon also did not have an opinion about the subjects visitors wanted to talk about, like football, video games, and favourite pizza toppings. Arckon is a pretty serious question-answering program, not aimed at small talk or entertainment. His strength instead is his ability to understand context where most chatbots notoriously lose track of it, especially when, as in this competition, users communicate in shorthand. At the same time, this ability also enables misunderstanding (as opposed to no understanding), and it was not uncommon that Arckon mistook a word’s role in the context. His common sense subsystem* could fix that, but I have yet to hook it up to the context system.
Overcoming human error
Visitors made so many misspellings that I fear any chatbot without an autocorrect will not have stood a chance. Arckon was equipped with four spell check systems: A list of common misspellings, an algorithm for typos, a gibberish detector, and grammar to recognise unpunctuated questions (verb before subject). While these autocorrected half of all mistakes, they still regularly caused Arckon to remark e.g. “Ae is not an English word” or “What does “wha” mean?”. To my surprise, this not only led users to repeat their questions with correct spelling, they also often apologised for the mistake, which is otherwise blamed on the program’s understanding. Arckon then applied the correction, continued where they had left off, and so the conversations muddled on. I had spent a week improving various conversation-repairing procedures, and I am glad they smoothed the interactions, but I would still rather have spent that time programming AI.
This is one area of improvement that turned out quite well. Arckon’s sentences are formulated through a grammatical template that decides when and how to connect sentences with commas, link words, or relative clauses, and I had expanded it to do more of this. In addition it contains rules to decide whether Arckon can safely use words like “he”, “them”, “also”, or “usually” to refer to previous context. Below is an example of one of the better conversations Arckon had that shows this in action.
And for balance, here is one of the more awkward exchanges with one of the school children:
The scoring system this year was ill suited to gauge the quality of the programs. Visitors were asked to vote for the best and second-best in two categories: “most human-like” and “overall best”. The problem with this voting system is that it disproportionately accumulates the votes on the two best programs, leaving near zero votes for programs that could very well be half-decent. As it turned out, the majority of visitors agreed that the chatbot Mitsuku was the best in both categories, and were just a little divided over who was second-best, resulting in minimal score differences below 1st place. The second-best in both categories was Uberbot. I am mildly amused that Arckon’s scores show a point I’ve been making about Turing tests: That “human” does not equate to “best”. Another chatbot scored the exact inverse, high for “human” but low for “best”.
Chatbots are the best at chatting
For the past 10 years now with only one exception, the Loebner Prize has been won by either Bruce Wilcox (creator of ChatScript) or Steve Worswick (creator of Mitsuku). Both create traditional chatbots by scripting answers to questions that they anticipate or have encountered before, in some places supported by grammatical analysis (ChatScript) or a manually composed knowledge database (Mitsuku) to broaden the range of the answers. In effect the winning chatbot Mitsuku is an embodiment of the old “Chinese Room” argument: What if someone wrote a rule book with answers to all possible questions, but with no understanding? It may be long before we’ll know, as Mitsuku was still only estimated 33% overall human-like last year, with 13 years of development.
The conceiver of the Turing test may not have foreseen so, but a program designed for a specific task generally outperforms more general purpose AI, even, evidently, when that task is as broad as open-ended conversation. AI solutions are more flexible, but script writing allows greater control. If you had a pizza-ordering chatbot for your business, would you want it to improvise what it told customers, or would you want it to say exactly what you want it to say? Even human call-center operators are under orders not to deviate from the script they are given, so much so, that customers regularly mistake them for computers. The chatbots participating in the Loebner Prize use tactics that I think companies can learn from to improve their own chatbots. But in terms of AI, one should not expect technological advancements from this direction. The greatest advantage that the best chatbots have, is that their responses are written and directed by humans who have already mastered language.
That is my honest impression of the entire event. Arckon’s performance was not bad. The conversation repairs, reasoning arguments, and sentence formulation worked nicely. It’s certainly not bad to rank third place to Mitsuku and Uberbot in the “best” category, and for once I don’t have to frustrate over being judged for “human-like” only. The conversations Arckon had weren’t that bad, there were even some that I’d call positively decent when the users also put in a little effort. The one downside is that at the end of the day, I have very little to show for my trouble. I didn’t win a medal or certificate, the exhibition was not noticeably promoted, and the Loebner Prize has always been an obscure event, as the BBC wrote. As it is, I’m not sure what I stand to gain from entering again, but Arckon will continue to progress regardless of competitions.
Once again, my thanks to Steve Worswick for keeping an eye on Arckon at the exhibition, and thanks to the AISB for trying to make a better event.