The Winograd Schema Challenge, a $25000 contest sponsored by the aptly named company Nuance Communications, has been put forth as a better test of intelligence than Turing Tests*. Although the scientific paper tiptoes around its claims, the organisers describe the contest as requiring “common sense reasoning”. This introductory article examines the test’s strengths and weaknesses in that regard.
Example of a Winograd Schema
A Winograd Schema is a sentence with an ambiguous pronoun (“it”),
that, depending on one variable word (“trash/drawer”), refers to either
the first or the second noun of the sentence (“tissue/key”). The
Challenge is to program a computer to figure out which of the two is
being referred to, when this isn’t apparent from the syntax. So what did
I put in the trash? The tissue or the key? To a computer that has never
cleaned anything, it could be either. A little common sense would sure
come in handy, and the contest organisers suggest that this takes
Common sense, not Google sense
Contrary to this example, good Winograd Schemas are supposed to be
Google-proof: In this case Googling “fast hare” would return 20x more
search results than “fast tortoise”, so the hare is statistically 20x
more likely to be the one who “was faster”. Although statistical
probability is certainly useful, this would make the contest won
simply by the company with the largest set of statistics. It takes no
reasoning to count how many times word A happens to coincide with word B
in a large volume of text. Therefore this example would preferably be
written with neutral nouns like “John beat Jack”, subjects of who we
have no pre-existing knowledge, but can still figure out which was
Having said that, some example schemas
involving “crop dusters” and “bassinets” still suggest that a broad
range of knowledge will be required. Although one could consult online
dictionaries and databases, the contest will have restrictions on
internet access to rule out remote control. So failure can also be due
to insufficient knowledge rather than a lack of intelligence, but I
suppose that is part of the problem to solve.
With the above two questions the 2015 Loebner Prize Turing Test* gave a tiny glimpse of Winograd Schemas in practice, and the answers
suggest that chatbots – the majority of participants – are not cut out
to handle them. Only 2 of 15 programs even answered what was asked. One
was my personal A.I. Arckon, the other was Lisa.
Chatbot systems are of course designed for chat, not logic puzzles, and
typically rely on their creators to anticipate the exact words that a
question will contain. The problem there is that the understanding of
Winograd Schemas isn’t found in which words are used, but in the
implicit relations between them. Or so we presume.
A more noteworthy experiment was done by the University of Texas, tested on Winograd Schemas composed by students.
To solve the schemas they used a mixed bag of methods based on human
logic, such as memorising sequences of events (i.e. verb A -> verb
B), common knowledge, sentiment analysis, and the aforementioned
Googling. All of this data was cleverly extracted from text by A.I.
software, or retrieved from online databases. However, many of the
schemas did not accord with the official guidelines, and though they
usefully solved 73% in total, only 65% was solved without the use of
According to the same paper, the industry standard “Stanford Coreference Resolver”
only correctly solved 55% of the same Winograd Schemas. The Stanford
Resolver restricts the possible answers by syntax, gender(“he/she”) and
amount(“it/they”), but does not examine them through knowledge or
reasoning. The reason for that is that this level of ambiguity is rare.
In my experience with the same methods however, it is still a
considerable problem that causes 1/10th of text-extracted knowledge to
be mistaken, with the pronoun “it” being the worst offender. So it
appears (see what I mean?) that any addition of common sense would
already advance the state of the art.
How to hack Winograd Schemas
Guesswork: Since the answers are a simple choice of two nouns, a
machine could of course randomly guess its way to a score of 50% or
more. So I did the math: With 60 schemas to solve, pure guesswork has a
5% chance to score over 60%, and a 0.5% chance to score over 65%. With
the odds growing exponentially unlikely, this is not a winning tactic.
That said, the participating A.I. still have to make a guess or default
choice at those schemas that they fail to solve otherwise. If an A.I.
can solve 30% of the schemas and guesses half of the rest right, its
total score amounts to 65%, equaling Texas’ score. It wouldn’t be until
it can solve around 80% of all schemas genuinely that it could reach the
winning 90% score by guessing the final stretch. That’s a steep slope.
Reverse psychology: Since Winograd Schemas are deliberately
made to not match Google search results, it seems that one can apply
reverse psychology and deliberately choose the opposite. While I did
notice such a tendency in Winograd Schemas composed by professors,
others have noticed that Winograd Schemas composed by students simply
did match Google search results. So the success of using reverse
psychology heavily depends on the cleverness of the composers. A
countermeasure would be to use only neutral names for answers, but this
may also cut off some areas of genuine reasoning. Alternatively one
could include an equal amount of schemas that match and mismatch Google
search results, so that neither method is reliable.
Pairing: One cheat that could double one’s success lies in the
fact that Winograd Schemas come in pairs, where the answer to the
second version is always the alternate noun. So if the A.I. can solve
the first version but not the second, it suffices to choose the
remaining alternate answer. Vice versa if it can solve the second
version but not the first. This rather undermines the reason for having
pairs: To ascertain that the first answer wasn’t just a lucky guess.
Although this hack only increases the success of guesswork by a few
percent, it could certainly be used to make a weak contestant into a
strong contender undeservedly.
I call these hacks because not only are they against intent, they are
also entirely useless in real life application. No serious researcher
should use them or they will end up with an inept product.
How you can’t hack Winograd Schemas
No nonsense: The judgement of the answers is clear and objective.
There is only one correct answer to each schema. The A.I. are not
allowed to dodge the question, make further inquiries or give
interpretable answers: It’s either answer A or B.
No humans: Erratic human performance of the judges and control
subjects does not influence the results. The schemas and answers have
been carefully predetermined, and schemas with debatable answers simply
do not make the cut.
No invisible goal: While the Turing Test is strictly a
win-or-lose game with the goalposts at fields unknown, the WSC can
reward gradual increase of the number of schemas answered correctly.
Partial progress in one area of common sense like spatial reasoning can
already show improved results, and some areas are already proving
feasible. This encourages and rewards short-term efforts.
I must admit that the organisers could still decide to move the
goalposts out of reach every year by omitting particular areas of common
sense once solved. I think this is even likely to happen, but at the
same time I expect the solutions to cover such a broad range that it
will become hard to still find new problems after 6 contests.
Mainly, the WSC trims many subjective variables from the Turing Test, making for a controlled test with clear results.
The Winograd Schema Challenge beats the Turing Test
From personal experience, Turing Tests that I participated in
have at best forced me to polish my A.I.’s output to sound less
robotic. That is because in Turing Tests, appearance is a first priority
if one doesn’t want to be outed immediately at the first question,
regardless how intelligent the answer is. Since keeping up appearances
is an enormous task in itself, one barely gets around to programming
intelligence. I’ve had to develop spell correction algorithms, gibberish
detection, letter-counting game mechanics, and a fictional background
story, before encountering the first intelligent question in a Turing
Test. It stalls progress with unintelligent aspects and is
Solving Winograd Schemas on the other hand forced me to program
common sense axioms, which can do more than just figure out what our
pronouns refer to. Indirect objects and locations commonly suffer from
even worse ambiguity that can be solved by the same means, and common
sense can be used to distinguish figurative speech and improve
problem-solving. But I’ll leave that as a story for next time.
We should be careful to draw conclusions from yet another behavioural
test, but whatever the Winograd Schema Challenge is supposed to prove,
it offers a practical test of understanding language with a focus on
common sense. As this has always been a major obstacle for computers,
the resulting solutions are bound to be useful regardless how
“intelligent” they may be found.
Read more in my report on the first Winograd Schema Challenge held in 2016*