Turing Test 2014: Results

My talking computer program “Arckon” went to Exeter University and took the qualifying test for the Loebner Prize contest for artificial intelligence organised by the AISB. The Loebner Prize is a Turing Test* in which programs have to respond to a series of questions as human as possible. Kind of like a job interview.

It’s not the heat, it’s the humanity
Last year proved* that language comprehension and reasoning alone weren’t going to cut it, so this time I also superimposed knowledge about the average human onto Arckon’s persona, so that he literally thought he was a human. I also rigged a few bluff answers that incorporated no intelligence whatsoever. The apparent necessity of these tactics caused much eye-rolling on my part, so I limited these preparations to two weeks and wasted only as much time as I gained from competitive motivation.
More useful was the addition of a small amount of common knowledge that I extracted from tourist guide pocket books (Funny perhaps, but they do cover the essential basics). While this was never a priority, it is convenient and the Loebner Prize was a good excuse to get around to it. Lastly, I decided to keep Arckon’s conversation skills enabled this time. Although this increased the chances that Arckon would say something awkward, I wanted to make the test results as interesting as possible, win or lose.

Equipment
10000 lines of code, 2200+ vocabulary, 3000+ facts, average human persona.

Questions and Answers
1. Hello, my name is Adam. What is your name?
I am named Mark. Hello Adam. Shall we begin the test?


2. I live in Exeter in the UK. Where do you live?
I live here.

(Arckon did not yet have a location system, so “here” is just a fake answer.)

3. I like to listen to music and play football. Do you have any hobbies?
No, I don’t have a hobby, although I am a human.

(The latter is actually a reasoning argument, in that Arckon “was” a human, most of whom do have hobbies.)

4. What is your favourite television program?
I don’t know if I like a television. Chatbots are useful programs, although chatbots are dumb.

(In which Arckon unintentionally insults his competitors. Arckon deconstructs compound words and thought they asked his opinion on “programs”. Hence he evaluated a type of programs he was familiar with.)


5. Which is bigger, a cat or a kitten?
A kitten is bigger.

(Of all 95 facts about cats that were in Arckon’s knowledge database, the size of kittens was not among them. As an honest “I don’t know” would earn no points anyway, this backup answer is a 50/50 bluff that just happens to be wrong.)

6. The car couldn’t fit into the parking space because it was too small. What was too small?
I believe the car could fit to park. I believe a car could fit really small. I don’t know what small was.

(This question is a Winograd Schema: The “it” is ambiguous to test the A.I.’s understanding. Arckon however uses linguistic rules of thumb to deal with references and so thought “it” was referring to the car.)

7. Which drink do you prefer, coffee, tea or hot chocolate?
I like coffee a lot that like not. Tea can not like either. Chocolates can not like either.

(Ackon’s flexible language processing is confused by the comma after “prefer”, and mistakes this for a summary of subjects that prefer drinks. He answers that tea can not prefer anything, knowing that it is an inanimate object.)

8. What’s my name?
You are named Adam.

(No surprises here. Both Arckon and chatbots have specialised login procedures to store the user’s name, so it’s not a very good test of general memory.)

9. Where do I live?
You live here.

(Again, no location system means no location knowledge. This fake answer is only coincidentally correct.)

10. I like reading science fiction books and car magazines. What do you like to read?
I read documents, articles and a text.

(Arckon retrieves these honest facts from memory and merges them into a summary.)

11. What’s the weather like where you are?
Hm, I don’t believe I am a weather, because I can not change and rain not.

(A flaw mistakes the “-where you are” for a relative clause “the weather -that you are”. Despite the misunderstanding, Arckon’s reasoning manages to argue that he does not behave like the weather, though the grammar is a bit murky. As the programs were cut off from sensors and internet, this question only tests if the program can lie.)

12. I recently visited New York. Where is your favourite place to visit and why?
Hmm, I don’t know where a favoured place is.

(I didn’t expect there to be so many questions about locations. Oh well.)

13. How many letters are there in the word perambulate?
11 letters.

(Yes, I programmed a specialised code to count letters. Computers are good at counting, you know.)

14. Andy is shorter than Mark and Tom is shorter than Andy. Who is the tallest?
I am kind of like an andy, because I am short. Tall is a giraffe.

(Arckon was named “Mark” for this test and so thought that he was being compared to Andy. Arckon could also not distinguish whether “Andy” was a name or another type of word. Ordinarily he would ask, but as he would get no reply I had him assume that any undeterminable word was an object. Lastly, Arckon does not restrict his answers to recent subjects only, and so names the tallest living being in his memory. Smart aleck.)

15. Are you married?
I don’t know if I am marryed. Are you marryed?

(To compensate the unknown answer and keep conversation rolling, Arckon reverses the question (Social Rules 101). Ironically, if it weren’t for my mistake in one of the output spelling rules, you would have thought he just copied the text.)

16. What would I do with a spade?
A spade is used and comes in handy eventually.

(Arckon knew the functions of rakes and shovels, but not spades. The backup answer should have said “I’m sure it’ll come in handy eventually”, but Arckon’s dynamic sentence formulation ruined the joke.)

17. What is the capital city of Colombia?
I’m not sure if the city is a capital.
(I didn’t realise this at first, but Arckon is literally defining “what” the city is. Not that he would have known the answer otherwise, I don’t even know it myself.)

18. What’s your favourite chocolate bar?
I don’t know if I like a chocolate. Do you like chocolates?

(Arckon actually did have enough information on chocolate to formulate an opinion, but an issue with compound words aborted the knowledge search. Alas.)

19. Which city did I visit recently?
Hmm, I don’t know if you visitted cities. You visitted York.

(Arckon doesn’t know that New York is a city, but does remember the tester visiting “new” York, and offers this as an alternative answer.)

20. How many siblings do you have?
I don’t have siblings. Did I pass the test?


The score: 59.17%
The score system was much better this year. It separately judged “correctness”, “relevance”, and “plausibility & clarity of expression”, which is a step up from “human-like”. All 20 participating programs were asked the 20 questions above. Arckon underperformed with a score of 60%, whereas the top three chatbots all scored close to 90%. Arckon’s problems were with compound words, common knowledge, and the lack of a system for locations (All a matter of development priorities).

A question of questions
According to the organisers, “these questions vary in difficulty and are designed to test memory, reasoning, general knowledge and personality.”, the latter meaning the program’s fictional human background story, or as I would call this particular line of questioning; “Small talk”. For the sake of objectivity I’ll try and categorise them:

Small talk:
1. What is your name?
2. Where do you live?
3. Do you have any hobbies?
4. What is your favourite television program?
5. Which drink do you prefer, coffee, tea or hot chocolate?
6. What do you like to read?
7. What’s the weather like where you are?
8. Where is your favourite place to visit and why?
9. Are you married?
10. What’s your favourite chocolate bar?
11. How many siblings do you have?

Memory:
1. What’s my name?
2. Where do I live?
3. Which city did I visit recently?

Common knowledge:
1. Which is bigger, a cat or a kitten?
2. What would I do with a spade?
3. What is the capital city of Colombia?

Reasoning:
1. The car couldn’t fit into the parking space because it was too small. What was too small?
2. Andy is shorter than Mark and Tom is shorter than Andy. Who is the tallest?

Clearly half the test is about the program’s human background story, although there were several solid tests of learning/memory and common knowledge. Reasoning, the one mental process we can readily call intelligent, was shown some consideration but hardly comes into play. The same can be said of language comprehension, as most questions were fairly standard phrasings. Chatbots would have the advantage here, coming equipped with answers to many anticipated personal questions, but the winners also did remarkably well on the knowledge questions. Unfortunately Arckon failed both the knowledge and reasoning questions due to missing facts and misunderstandings, despite having the mechanisms to answer them. It is worth noting though, that he failed them because complex analyses are much more difficult than preprogrammed “I live here” answers.

How now brown cow?
I can improve Arckon’s understanding, smoothen his output grammar, and develop a location system, but I can’t deny the pattern: Arckon is stuck around a 60% score even with varied questions. I doubt he’s ever going to shine in the Loebner Prize as long as he’s being tested for being human, because he isn’t a human, and I won’t go to great lengths to fake it either. I also expect attention for Turing Tests to dwindle once the year is over; This year an other Turing Test was passed by a technologically unremarkable chatbot, Eugene Goostman.
Thanks to that event however, the Loebner Prize is no longer the only game in town. Next year will see the first Winograd Schema Challenge, a test focused on language comprehension and reasoning A.I., exactly what I focused on.

As for the Loebner Prize, it’s been an interesting game that will continue to be won by top chatbots. I’m sure few will bother to read the transcript of the 14th ranking entry, but its existence proves at least that Arckon is real and different. Meanwhile I get to continue my exciting recent developments that would have been of no use in this contest, which makes losing a positive outcome after all.