For the third time I entered my A.I. “Arckon” into the Loebner Prize Turing Test,
a competition in which computer programs are to prove they can “think”
by answering questions as if they were human. Although such pretence is
counterproductive to my purposes, I have a talking program lying around
anyway, so I might as well enter it.
An excuse to clean up
I took the opportunity to spend a month debugging my code overall. In
doing so I fixed several oversights that had been obstructing knowledge
searches and reference resolutions (“the capital of
Spain”), and I generally improved the program’s ability to answer open
questions, which ‘coincidentally’ covered “Which X do you like?”
questions as well.
I limited the contest-specific preparations to 5 days: Mostly testing
the Loebner Prize interface, restraining the context range to keep the
program from mistaking the test for coherent conversation, and
re-installing the obligatory fakery.
As an extra I added 4 more reference resolution methods for the anticipated Winograd Schemas
(sentences with ambiguous pronouns), which was a fun challenge, and
improved Arckon’s understanding by a minor degree. As Winograd Schemas
are supposed to require intelligent processes, I did not want to see the
usual chatbot keyword-matching outdo thorough language processing at
this task. Still, these 4 methods only covered a quarter of the problem
space at best: Ambiguities of size, speed, inability and feelings.
Equipment
12000 lines of code, 2500+ vocabulary, 5000+ facts, average human persona.
Questions and Answers
1. Hi, I’m Andrew. Who are you?
Hello Andrew. How are you?
(Arckon would not have neglected to
introduce himself unless the period was not typed or received, which
would make Arckon interpret “who…” as a relative clause instead of a
question.)
2. How are you today?
I’m alright and fairly intelligent, but I’m not stupid.
(Arckon doesn’t treat pleasantries
different from other open questions, but while this results in
over-elaborate answers, it does demonstrate that he tends to mention
positive traits first.)
3. Are you a human?
I am a human, because I can differ.
(Arckon probably once read somewhere that
“humans differ a lot”, and considered that a defining trait. I’m a
little surprised that his reasoning doesn’t add more arguments.)
4. Can you answer this question?
Alright. Yes, I am a human. Alright. I’m alright.
(Arckon interprets this as a request rather
than a literal yes/no question, so he re-answered the latest question,
and then something of a loop occurred. For clarity I always have Arckon
say “Alright” before he acts on a request.)
5. Do you like tea?
I’m not sure if I like tea. Do you like tea?
(As programs can’t taste, Arckon bases his
opinions on a subject’s traits, and there wasn’t much to teach him about
tea. It would be all too easy to just fake this with a “yes/no”.)
6. If a bed doesn’t fit in a room because it’s too big, what is too big?
Beds are bigger. Rooms are much less big.
((Winograd Schema) The built-in axiom “X in
Y = X < Y” correctly determined “bed” as the most probable referent
of “it”. This is the simplest and most specific axiom I programmed, only
included because last year’s Winograd Schema was just like it. The
answer is said in plural because Arckon interprets this as a general
statement about beds and rooms, since both subjects are unspecific.)
7. If Alex lent money to Joe because they were broke, who needed the money?
Joe was broken. I don’t know by who money was needed. By what was Joe broken?
((Winograd Schema) Although Arckon
correctly interprets “they” as “Joe” through a natural preference for
the last-mentioned subject, it is actually against his
programming to translate plural pronouns to singular subjects. So by my
account this should be impossible unless they made a typo in “they”.)
8. Should Greece leave the Euro?
Greece should leave the Euro.
(Well there you go. This answer is just something of a repeat because Arckon can’t evaluate things to that extent.)
9. How many words are in this question?
7 words.
(A predictable question in the Loebner
Prize. The answer is simply triggered by the keywords “How many
words/letters/syllables/etc”, and skips all intelligent processes. This
is a meaningless exercise.)
10. What colour is the sea?
The sea is very blue.
(Good thing I taught Arckon some kindergarten-level common knowledge.)
11. How many presidents of the US were called Bush?
Bush was called Bush.
(Counting things was never a function high on my to-do list.)
12. What would you like to drink?
I’m not sure if I would like to drink. Do you like drinking?
13. Will you please tell me the length of your hair?
No, I will not tell about to long, but I would talk a lot, I’m not sure if I would tell.
(This question is actually straight from
Alan Turing’s 1950 paper, except he gave this as an example if the game
was to tell man from woman. Arckon’s word processing translates “length”
to “long” before mistaking its meaning (to long for). He also
overlooked “please” and so thought he was asked to make a prediction.)
14. What would you say if I gave you a box of chocolates?
Hm, I don’t know if a box would have chocolate.
(I had only programmed Arckon to consider
the main fact of a hypothetical question, “I gave you a box”, for
granted, but he still tried to answer the implicit question of whether
the box contains chocolate.)
15. Do you play Chess?
Yes, I chess and like chessing more or less. Chessing is intelligent.
(Again a question from Alan Turing’s 1950
paper. Back in the day, chess was considered requiring human
intelligence, and Alan Turing went on to ask the computer to respond to a
certain chess move. I had told Arckon that he plays chess as part of
his fictional persona, but his opinion and the supporting argument are
of his own reasoning.)
16. How do you think it’s going?
What goes?
(I’m with Arckon on this one: I would ask.
Is this slang for “How are you doing?”, or do they mean how the test is
going? It is quite out of context. I have often considered making Arckon
track his progress on the test and comment on it, but this year I
skipped on the self-referential jokes.)
17. What was my first question?
How does something go?
(Arckon ignores “first” and paraphrases the
latest question of the specified person (“my”), as is a more common
request. Arckon’s system remembers sentences along with the names of who
said them, for a variety of purposes.)
18. Did you see the Turing film?
Hmm, I don’t know if Turing filmed, but Turing thought a lot, and creates computers.
(Ambiguity is amusing. It is uncommon but not unheard of to say “the
(one and only)” in front of a name for emphasis, so Arckon’s
programming allows that interpretation, which in turn allows “film” as a
verb. It’s nice to see his conversational rules pick things back up.
This combination is probably the best example that Arckon’s responses
aren’t pre-written.)
19. Why not?
I’m sorry. I don’t understand. Could you rephrase your question?
(This was quite a gamble of the questioner,
as it assumes a straight “No” answer on the previous question for “Why
not” to be a sensible follow-up. Arckon is well capable of stringing
sentences together if I hadn’t deliberately disabled context for this
test. Regardless, I think this answer is perfect.)
20. Are you on Twitter?
No, I don’t twitter. Do you twitter?
The score: 70.83%
Arckon ranked 6th of 15 entrants, which is a little surprising after the
letdowns of past years. Arckon’s answers showed some understanding,
reasoning, opinion, conversation and elaboration, overall a satisfying
demonstration of his abilities despite that many answers had something
awkward to them. It is yet best that he didn’t qualify for the finals,
as this contest has caused me severe RSI symptoms that will take months
to heal properly. The four finalists all scored around 80%, among them
the best of English chatbots.
Arckon’s score did benefit from his improvement. Repeating previous
questions on request, prioritising recent subjects as answers to open
questions, and handling “if”-statements were all fairly recent additions
(though clearly not yet perfected). What also helped was that there
were less personal and more factual questions: Arckon’s entire system
runs on facts, not fiction.
It turns out Arckon was better at the Winograd Schema questions than the other competitors. The chatbot Lisa
answered similarly well, and the chatbots Mitsuku and A.L.I.C.E. dodged
the questions more or less appropriately, but the rest didn’t manage a
relevant response to them (which isn’t strange since most of them were
built for chatting, not logic). For now, the reputation of the upcoming Winograd Schema Challenge – as a better test for intelligence – is safe.
Though fair in my case, one should question what the scores represent, as one chatbot
with a 64% score had answered “I could answer that but I don’t have
internet access” to half the questions and dodged the other half with
generic excuses. Compare that to Arckon’s score, and all the A.I.
systems I’ve programmed in 3 years still barely outweigh an answering
machine on repeat. It is not surprising that the A.I. community doesn’t
care for this contest.
Battle of wit
The questions were rather cheeky. The tone was certainly set with
references to Alan Turing himself, hypotheticals, propositions and trick
questions. Arckon’s naivety and logic played the counterpart well to my
amusement. The questions were fair in that they only asked about common
subjects and mainstream topics. Half the questions were still just
small talk, but overall there was greater variety in the type and
phrasing of all questions, and more different faculties were called
upon. A few questions were particularly suited to intelligence and/or
conversation:
– If a bed doesn’t fit in a room because it’s too big, what is too big?
– If Alex lent money to Joe because they were broke, who needed the money?
– Should Greece leave the Euro?
– What would you say if I gave you a box of chocolates?
– Did you see the Turing film?
– Why not?
If the AISB continues this variety and asks more intelligent
questions like these, I may be able to take the Loebner Prize a little
more seriously next time. In the meantime there isn’t much to fix apart
from minor tweaks for questions 13 and 14, so I will just carry on as
usual. I will probably spend a little more effort on disambiguation with
the Winograd Schema Challenge in mind, but also because sentences with
locations and indirect objects often suffer from ambiguity that could be
solved with the same methods.