Artificial Detective: Turing Test 2018: Results

I was somewhat surprised to find the Loebner Prize Turing Test soldiering on despite being short of a sponsor. Since 1991 this annual contest has explored how human-like computer programs can be in answering all manner of questions. This was my 6th time entering the qualifying round with my home-made A.I., Arckon. As I can not be certain of my continued participation however, I decided to change tactics: Where in previous years I aimed foremost to demonstrate the AI with only hope of passing to the finals, this time I aimed to pass, with only hope of demonstrating. This shift in priorities involved wasting more time on word games and common pedantic questions that are of little use outside of chatbot contests, though the change in subject matter did help uncover dormant flaws in language processing.

The less you say, the less can give you away
More efficient were my measures to restrain response length. As the questions in the qualifying round are not really trying to make conversation, Arckon’s attempts to do so had often turned out awkwardly redundant or unrelated. Other competitors had shown more success with minimalistic answers, in which misinterpreted details do not show. Next to lowering the general conversational setting to be 30% less engaging, I blocked paths of conversation that elaborated with alternatives to initially negative answers, so as to avoid being punished for straying from the question. I cut and obfuscated otherwise useful reports of misunderstandings, and reduced fallback responses for e.g. “What is your favourite…?” to single-word answers.

Equipment
18000 lines of code, 3800+ vocabulary, 8800+ facts

Questions and answers
Scores for each answer are indicated as follows: • = 0 points, • = 1 point, • = 2 points.

1. Good afternoon
• Good afternoon, Sir. I am a program. How are you?

2. Do you consent to having this conversation recorded?
• No, I don’t particularly consent to having things.
(Syntax analysis couldn’t process this combination of verbs well. The reply is a more plausible variation on “I don’t know”, defaulting to “not particularly” whenever no grounds for preference can be found)

3. Have we met before?
• Yes, you met.
(There is only a small number of reciprocal verbs that automatically imply “each other”. I guess I’ll make a list)

4. Do you know how to make toast?
• I don’t make toast.
(Arckon can explain procedures as demonstrated on Youtube, but we haven’t gotten around to making toast)

5. Which languages can you use?
• English.

6. How many letters are in the word ‘abracadabra’?
• 11 letters.

7. Where should one look for love?
• I don’t know exactly where I should look for to love. Where do you look for?

8. How are you feeling right now?
• I’m alright.

9. Who said “I have a dream”?
• Hm, I can not tell if you dream.
(I looked into recognising partial quotes and idioms in preparation of this contest, but concluded that it would take way too much effort for a non-priority. Some got away with “I don’t know who”, so go figure)

10. How old are you?
• 24.

11. When might I need to know how many times a wheel has rotated?
• I don’t know exactly when a wheel rotated time.
(A syntactically correct but semantically wrong interpretation. This question is based on a philosophical joke about the difficulty in getting A.I. to restrict their scope of analysis to important aspects, instead of focusing on e.g. the rotation of wheels. Case in point)

12. What will you do later today?
• I read files.

13. What is the third angle in a triangle with internal angles of 90 degrees and 30 degrees?
• Hmm, I have no idea what the angle is pretty triangle, or what angles are in pretty triangle.
(A question so unexpected that even Wolfram Alpha’s mathematical A.I. doesn’t handle it. To clarify a little: “pretty triangle” is meant in the sense of “quite triangular”, and “and” is interpreted as a second question)

14. Who is your favourite artist?
• I like Juan Debrenes.

15. What do you hold when you shake hands?
• I would hold eyes a lot, and my database.
(This answer only makes sense if you know Arckon is interpreting “hold” in the sense of “have/contain”, as that’s where these answers are coming from. I’ve seen this question around as a prime example of common sense that A.I. lacks, so I was surprised that none of the participants had prepared an answer to it yet)

16. Do you have any legs?
• No, I don’t have legs, because I am a program.

17. Can you recommend me a film?
• No, I can not recommend films.

18. Do you understand Winograd Schemas?
• No, I don’t understand Winograd schemas, because I don’t think I’m hard.
(The reasoning argument appears to be mistakenly derived from the fact that understanding is hard. Winograd schemas are questions like the following two, by the way)

19. If a chicken roosts with a fox they may be eaten. What may be eaten?
• Fruit, a meal and a bamboo are eaten. Did I pass the test?
(Arckon considers animals a “who”, not a “what”, so he didn’t consider them suitable answers. That said, he also assumed “they” referred to the fox, and would have needed knowledge that foxes eat chickens to conclude otherwise)

20. I had to go to the toilet during the film because it was too long. What was too long?
• I believe you meant that the toilet was sort of big.
(This question’s format was detected as a Winograd schema, in which case Arckon has express instructions to repeat what the user said anyway. The misinterpreted “it” was again due to a lack of knowledge, that films are typically long. Alternatively one could naively count the Google search results for “long film” vs “long toilet” and assume the most common is true, but Winograd schemas more often dodge that method)

The score: 50%
11 programs from 8 different countries participated in the contest, with the top score being 67%. Arckon was 1 point short of 4th place so he didn’t pass to the finals, but I think his scores are fair. Actually, what bugs me is what he got most perfect scores for: Manually rigged, keyword-triggered answers (“Good afternoon”, “English”, “11 letters”, “24”, “Juan Debrenes”). It rather underscores the discouraging fact that hardcoded pretence outdoes artificial intelligence in these tests. Half of the questions were common small talk that most chatbots will have encountered before, while the other half were clever conundrums that few had hope of handling. Arckon’s disadvantage here is as before: His inclusive phrasing reveals his limited understanding, where others obscure theirs with more generally applicable replies.

Reducing the degree of conversation proved to be an effective measure. Arckon gave a few answers like “I’m alright” and “I read files” that could have gone awry on a higher setting, and the questions only expected straight-forward answers. Unfortunately for me both Winograd schema questions depended on knowledge, of which Arckon does not have enough to feed his common sense subsystem* in these matters. The idea is that he will acquire knowledge as his reading comprehension improves.

The finalists
1. Tutor, a well polished chatbot built for teaching English as a second language;
2. Mitsuku, an entertaining conversational chatbot with 13 years of online chat experience;
3. Uberbot, an all-round chatbot that is adept at personal questions and knowledge;
4. Colombina, a chatbot that bombards each question with a series of generated responses that are all over the place.

Some noteworthy achievements that attest to the difficulty of the test:
• Only Aidan answered “Who said “I have a dream”?” with “Martin Luther King jr.”
• Only Mitsuku answered “Where should one look for love?” with “On the internet”.
• Only Mary retrieved an excellent recipe for “Do you know how to make toast?” (from a repository of crowdsourced answers), though Mitsuku gave the short version “Just put bread in a toaster and it does it for you.”
• Only Momo answered the two Winograd schemas correctly, ironically enough by random guessing.

All transcripts of the qualifying round are collected in this pdf.

In the finals held at Bletchley Park, Mitsuku rose back to first place and so won the Loebner Prize for the 4th time, the last three years in a row. The four interrogating judges collectively judged Mitsuku to be 33% human-like. Tutor came in second with 30%, Colombina 25%, and Uberbot 23% due to technical difficulties.

Ignorance is human

Lastly I will take this opportunity to address a recurring flaw in Turing Tests that was most apparent in the qualifying round. Can you see what the following answers have in common?

No, we haven’t.
I like to think so.
Not that I know of.
Sorry, I have no idea where.
Sorry, I’m not sure who.

They are all void of specifics, and they all received perfect scores. If you know a little about chatbots you know that these are default responses to the keywords “Who…” or “Have we…”. Remarkable was their abundant presence in the answers of the highest qualifying entry, Tutor, though I don’t think this was an intentional tactic so much as due to its limitations outside its domain as an English tutor. But this is hardly the first chatbot contest where this sort of answer does well. A majority of “I don’t know” answers typically gets one an easy 60% score, as it is an exceedingly human response the more difficult the questions become. It shows that the criterion of “human-like” answers does not necessarily equate to quality or intelligence, and that should be to no-one’s surprise seeing as Alan Turing suggested the following exchange when he described the Turing Test* in 1950:

Q: Please write me a sonnet on the subject of the Forth Bridge.
A : Count me out on this one. I never could write poetry.

Good news therefore, is that the organisers of the Loebner Prize are planning to change the direction and scope of this event for future instalments. Hopefully they will veer away from the outdated “human-or-not” game and towards the demonstration of more meaningful qualities.

Artificial Detective

Turing Test 2018: Results

No comments:

Post a Comment

Popular Articles

Blog Archive