Artificial Detective: Is the Winograd Schema Challenge a good test?

The Winograd Schema Challenge, a $25000 contest sponsored by the aptly named company Nuance Communications, has been put forth as a better test of intelligence than Turing Tests*. Although the scientific paper tiptoes around its claims, the organisers describe the contest as requiring “common sense reasoning”. This introductory article examines the test’s strengths and weaknesses in that regard.

Example of a Winograd Schema

I used a tissue to clean the key, and then I put it in the drawer.
I used a tissue to clean the key, and then I put it in the trash.

A Winograd Schema is a sentence with an ambiguous pronoun (“it”), that, depending on one variable word (“trash/drawer”), refers to either the first or the second noun of the sentence (“tissue/key”). The Challenge is to program a computer to figure out which of the two is being referred to, when this isn’t apparent from the syntax. So what did I put in the trash? The tissue or the key? To a computer that has never cleaned anything, it could be either. A little common sense would sure come in handy, and the contest organisers suggest that this takes intelligent reasoning.

Common sense, not Google sense

The hare beat the tortoise because it was faster.
The hare beat the tortoise because it was too slow.

Contrary to this example, good Winograd Schemas are supposed to be Google-proof: In this case Googling “fast hare” would return 20x more search results than “fast tortoise”, so the hare is statistically 20x more likely to be the one who “was faster”. Although statistical probability is certainly useful, this would make the contest won simply by the company with the largest set of statistics. It takes no reasoning to count how many times word A happens to coincide with word B in a large volume of text. Therefore this example would preferably be written with neutral nouns like “John beat Jack”, subjects of who we have no pre-existing knowledge, but can still figure out which was faster.

Having said that, some example schemas involving “crop dusters” and “bassinets” still suggest that a broad range of knowledge will be required. Although one could consult online dictionaries and databases, the contest will have restrictions on internet access to rule out remote control. So failure can also be due to insufficient knowledge rather than a lack of intelligence, but I suppose that is part of the problem to solve.

Early indications

If a bed doesn’t fit in a room because it’s too big, what is too big?
If Alex lent money to Joe because they were broke, who needed the money?

With the above two questions the 2015 Loebner Prize Turing Test* gave a tiny glimpse of Winograd Schemas in practice, and the answers suggest that chatbots – the majority of participants – are not cut out to handle them. Only 2 of 15 programs even answered what was asked. One was my personal A.I. Arckon, the other was Lisa. Chatbot systems are of course designed for chat, not logic puzzles, and typically rely on their creators to anticipate the exact words that a question will contain. The problem there is that the understanding of Winograd Schemas isn’t found in which words are used, but in the implicit relations between them. Or so we presume.

The mermaid swam toward Sue and waved her tail. (Googleable)
The mermaid swam toward Sue and made her gasp. (More than a single change)

A more noteworthy experiment was done by the University of Texas, tested on Winograd Schemas composed by students. To solve the schemas they used a mixed bag of methods based on human logic, such as memorising sequences of events (i.e. verb A -> verb B), common knowledge, sentiment analysis, and the aforementioned Googling. All of this data was cleverly extracted from text by A.I. software, or retrieved from online databases. However, many of the schemas did not accord with the official guidelines, and though they usefully solved 73% in total, only 65% was solved without the use of Google.
According to the same paper, the industry standard “Stanford Coreference Resolver” only correctly solved 55% of the same Winograd Schemas. The Stanford Resolver restricts the possible answers by syntax, gender(“he/she”) and amount(“it/they”), but does not examine them through knowledge or reasoning. The reason for that is that this level of ambiguity is rare. In my experience with the same methods however, it is still a considerable problem that causes 1/10th of text-extracted knowledge to be mistaken, with the pronoun “it” being the worst offender. So it appears (see what I mean?) that any addition of common sense would already advance the state of the art.

How to hack Winograd Schemas
Guesswork: Since the answers are a simple choice of two nouns, a machine could of course randomly guess its way to a score of 50% or more. So I did the math: With 60 schemas to solve, pure guesswork has a 5% chance to score over 60%, and a 0.5% chance to score over 65%. With the odds growing exponentially unlikely, this is not a winning tactic.
That said, the participating A.I. still have to make a guess or default choice at those schemas that they fail to solve otherwise. If an A.I. can solve 30% of the schemas and guesses half of the rest right, its total score amounts to 65%, equaling Texas’ score. It wouldn’t be until it can solve around 80% of all schemas genuinely that it could reach the winning 90% score by guessing the final stretch. That’s a steep slope.

Reverse psychology: Since Winograd Schemas are deliberately made to not match Google search results, it seems that one can apply reverse psychology and deliberately choose the opposite. While I did notice such a tendency in Winograd Schemas composed by professors, others have noticed that Winograd Schemas composed by students simply did match Google search results. So the success of using reverse psychology heavily depends on the cleverness of the composers. A countermeasure would be to use only neutral names for answers, but this may also cut off some areas of genuine reasoning. Alternatively one could include an equal amount of schemas that match and mismatch Google search results, so that neither method is reliable.

Pairing: One cheat that could double one’s success lies in the fact that Winograd Schemas come in pairs, where the answer to the second version is always the alternate noun. So if the A.I. can solve the first version but not the second, it suffices to choose the remaining alternate answer. Vice versa if it can solve the second version but not the first. This rather undermines the reason for having pairs: To ascertain that the first answer wasn’t just a lucky guess. Although this hack only increases the success of guesswork by a few percent, it could certainly be used to make a weak contestant into a strong contender undeservedly.
I call these hacks because not only are they against intent, they are also entirely useless in real life application. No serious researcher should use them or they will end up with an inept product.

How you can’t hack Winograd Schemas
No nonsense: The judgement of the answers is clear and objective. There is only one correct answer to each schema. The A.I. are not allowed to dodge the question, make further inquiries or give interpretable answers: It’s either answer A or B.

No humans: Erratic human performance of the judges and control subjects does not influence the results. The schemas and answers have been carefully predetermined, and schemas with debatable answers simply do not make the cut.

No invisible goal: While the Turing Test is strictly a win-or-lose game with the goalposts at fields unknown, the WSC can reward gradual increase of the number of schemas answered correctly. Partial progress in one area of common sense like spatial reasoning can already show improved results, and some areas are already proving feasible. This encourages and rewards short-term efforts.
I must admit that the organisers could still decide to move the goalposts out of reach every year by omitting particular areas of common sense once solved. I think this is even likely to happen, but at the same time I expect the solutions to cover such a broad range that it will become hard to still find new problems after 6 contests.

Mainly, the WSC trims many subjective variables from the Turing Test, making for a controlled test with clear results.

The Winograd Schema Challenge beats the Turing Test
From personal experience, Turing Tests that I participated in have at best forced me to polish my A.I.’s output to sound less robotic. That is because in Turing Tests, appearance is a first priority if one doesn’t want to be outed immediately at the first question, regardless how intelligent the answer is. Since keeping up appearances is an enormous task in itself, one barely gets around to programming intelligence. I’ve had to develop spell correction algorithms, gibberish detection, letter-counting game mechanics, and a fictional background story, before encountering the first intelligent question in a Turing Test. It stalls progress with unintelligent aspects and is discouragingly unrewarding.

Solving Winograd Schemas on the other hand forced me to program common sense axioms, which can do more than just figure out what our pronouns refer to. Indirect objects and locations commonly suffer from even worse ambiguity that can be solved by the same means, and common sense can be used to distinguish figurative speech and improve problem-solving. But I’ll leave that as a story for next time.

We should be careful to draw conclusions from yet another behavioural test, but whatever the Winograd Schema Challenge is supposed to prove, it offers a practical test of understanding language with a focus on common sense. As this has always been a major obstacle for computers, the resulting solutions are bound to be useful regardless how “intelligent” they may be found.

Read more in my report on the first Winograd Schema Challenge held in 2016*

Artificial Detective

Is the Winograd Schema Challenge a good test?

1 comment:

Popular Articles

Blog Archive