Winograd Schema Challenge 2016: Results


Well.

This wasn’t quite the Winograd Schema Challenge that I had set out on. Originally this language comprehension contest for A.I. was announced in July 2014, to be run in October 2015, but was postponed to February 2016, and then again to July 2016. I was just about to ship my program overseas, three weeks before the last-accepted arrival date of postal entries, when the contest announced changes to the rules and technical format.

Some universities had been training with ambiguous pronouns like this:
The birds ate the seeds because they were hungry.

I had been practising on the official Winograd schemas like this:
The foxes are getting in at night and attacking the chickens. I shall have to guard them.

Whereas the final test featured this:
Mark became absorbed in Blaze, the white horse. He was afraid the stable boys at the Burlington Stables struck at him and bullied him because he was timid, so he took upon himself the feeding and care of the animal.

The programs were now faced with any number of consecutively ambiguous pronouns in passages from 1940’s children’s novels, which made quite a difference. It turns out the organisers had already decided on this last year, as appears from their sensible enough explanation in a members-only AI magazine (Winograd schemas are too hard to compose). Unfortunately they somehow did not see fit to share these changes on the contest website until too late. While the benchmark of 65% had previously been feasible, it now quickly became unlikely that anyone would win anything this year. A number of would-been participants backed out.

The contest finally took place at the IJCAI conference in New York with four contestants: the Open University of Cyprus, the University of Science and Technology of China, the independent Denis Robert from France, and myself from the Netherlands. Curiously absent were a number of American universities who had previously reported successes of over 70% for solving Winograd schemas. The absence of Google, IBM, and other commercial powerhouses was less strange, if you consider that the winner was obligated to publish their methods so that others could reproduce them. And that anything below human level would be portrayed as a failure in the media.

The glass is half full
The programs were asked to figure out 60 multiple choice pronouns, with such ambiguity that they were to be solved through an understanding of the context. With two to five potential answers per pronoun, the baseline score for guesswork was 45%. $1000 would be awarded for a 65% score, and $25000 for a 90% score, human level

(Note: these are the scores after recount. There was some confusion as my program had omitted two answers)

Contestant Correct answers out of 60 Method
Quan Liu 35 / 35 / 29 (58% – 48%) deep neural network & ConceptNet
Nikos Isaak 29 (48%) probabilistic engine & knowledge extraction
Patrick Dhondt 29 (48%) logical axioms
Denis Robert 19 (32%) logical inferences

Quan Liu’s group entered three programs, which is a little unorthodox for contests. But if you see this as a scientific test then it makes sense to test which configuration of a neural network works best. Their machine learning approach gathered pairs of events (mainly verbs) that are commonly associated, e.g. “rob -> be arrested”, and then applied their probability of co-occurring. Two of their versions scored the highest, 58%, which is consistent with the track record of similar approaches.

The unusual score of Denis Robert’s system, below the 45% guesswork baseline, can largely be explained by the fact that his system was not designed for cases with more than two possible answers, as this was only changed on short notice. However, he also indicated that his algorithm didn’t apply to most of the cases.

There were nevertheless no winners that reached the 65% threshold. On the one hand one could say that technology is literally halfway human ability, on the other hand the programs did only a little better than one might by chance. Any conclusion drawn from just the scores would be premature. If this test is to be a meaningful measure of progress, we should look at which areas the programs were better or worse in. To this I can at least answer about my own approach.

Winograd schemas vs prose
The ambiguity in the new prose form was actually not so bad compared to previously published Winograd schemas. But the phrasing was often excessively long-threaded with all sorts of interjected tangents. Although I built my program for reading articles and dialogue alike, I had not covered the grammar of interrupting phrases that break up the main thread of a sentence. Such sentence structures are abundant in story novels but do not occur in Winograd schemas, and I wasn’t planning on having my A.I. read novels any time soon. The inclusion of some 1940’s vocabulary also complicated matters: “cook-shanty”, “red-letter days”, “a pallid young dandy”? Maybe it’s because I’m Dutch, but I can only guess what these are.

Compared to the wide variety of common sense axioms that I had programmed (see How to teach a computer common sense*), many solutions to the pronouns were ordinary cases of continuity. E.g. a pronoun with an active role typically refers to the last noun with an active role (You won’t find this rule in a grammar book, because ambiguous pronouns are grammatically “incorrect” to begin with).

Always before, Larry had helped Dad with his work. But he could not help him now […]
The donkey wished a wart on its hind leg would disappear, and it did.
Mark was close to Mr. Singer’s heels. He heard him calling for the captain […]

This makes sense when you’re testing on novels: No storyteller wants to write in such a counter-intuitive way that the reader has to stop and think about it, contrary to Winograd schemas which are designed for exactly that purpose.
Where no particular common sense axiom applied, rules of continuity and grammar chose 21 of my 29 correct answers. Thus two thirds of my success seemed not due to the application of common sense, but due to conventional writing. Curious, I ran the test again with all axioms disabled except continuity. The result was an equal amount of correct answers, but much more randomly distributed and obviously chosen for the wrong reasons. The common sense axioms clearly contributed by fencing off the exceptions to continuity, so the cause of the mistakes lay elsewhere.

A closer look at the results
The results below show which of the 60 pronouns my program got correct, which axioms were applicable, and/or which problems hindered their conclusion. Where no axiom applied or a problem occurred, the program defaulted to the grammatically correct choice: The candidate closest to the pronoun. Only 1/3rd of all pronouns actually conformed with this grammar rule, which explains why whenever a problem occurred, the answer was typically wrong.


I will highlight the most prominent mistakes:
2 & 3. Always before, Larry had helped Dad with his work. But he could not help him now […]
Logic could expect Dad to return the favour, were it not that “always” and “now” suggest a continuity, which the program did not pick up on. Consequently, the answers to both “he” and “him” were switched around. This also illustrates why this test was more difficult than chance: The more ambiguous pronouns a passage contained, the more likely a mistake in one would carry over to the others.

9. What about the time you cut up tulip bulbs in the hamburgers because you thought they were onions?
For this the program compared the similarities of bulbs, hamburgers and onions, but of course knowledge of onions was lacking in the database, so the inference fell flat. Retrieving such knowledge from the internet would slow things down, and though speed is no issue in a contest, in daily practice I want my program to read one page per second, not one sentence per second.

13. […] Antonio, takes Larry under his wing.
People aren’t known to have wings, otherwise the bodypart location paradox would have excluded Larry from being taken under his own wing. Alternatively one would have to know figurative meanings of English idioms, an added layer of difficulty.

18. [Maude…] had left poor little Dora to do the best she could, alone.
The program considered “to…” to indicate Maude’s reason for leaving “in order to” do something. The pronoun wasn’t the only ambiguous word in this case.

30. […] Mr. Moncrieff has decided to cancel Edward’s allowance on the ground that he no longer requires his financial support.
“Backward” = “back”, “Southward” = “south”, therefore “Edward” = “Ed”. Although the pronoun was interpreted correctly, “Ed” was of course not found among the multiple choice answers.

40. Every day after dinner Mr. Schmidt took a long nap. Mark would let him sleep for an hour, then wake him up, scold him, and get him to work. He needed to get him to finish his work, because his work was beautiful.
As I mentioned in my previous post, the “what goes around comes around” axiom was the least reliable, causing five misinterpretations in this test. Sometimes it triggered on trivial events, other times the events did not make sense this way (scolding to get someone to do something positive). It had better be limited to events that are direct cause and result, as they had been in most Winograd schemas.

49. Of one thing Mark was sure. Harry knew much less than he did.
Consecutive mental activities are typically by the same person, but of course not when it’s a comparison. Though the context system does distinguish comparisons, the axioms did not.

56. Tatyana managed two guitars and a bag, and still could point out the Freemans: “Isn’t it nice that they have come, Mama!”
While the pronoun was interpreted correctly, there was a technical hitch with selecting “freemans” from the multiple choice answers, due to the name having a plural -s.

59. Grant worked hard to harvest his beans so he and his family would have enough to eat that winter, His friend Henry let him stack them […]
“enough” was internally translated to “enough beans” but lost its plural status in the translation, after which the beans were no longer considered a candidate for plural “them”.

Most of these problems are easily fixed and are not inherent to the common sense axioms, apart from #40 and its like. The majority of problems were instead linguistic: Small flaws in the grammar rules, difficulty with long-threaded phrasing, limited coverage of the context system, and problems with the contest’s XML-format interface. It just goes to show how perfect every part of the system has to be before it pays off, and how little you can tell about a program’s abilities from the surface.

The language barrier
As a test of common sense I found this setup less suitable than the original plan with Winograd schemas, which were more concise and profound in which areas of common sense they tested (e.g. spatial relations, physics, social interactions). Had I known from the start that the qualifying round would mainly feature novel prose, I would probably not have embarked on this challenge, knowing that my grammar parser wasn’t up for it. Now the prose passages contained too many variables to tell whether results were due to language or common sense, and it never got to the Winograd schema round. This puts us back at the Turing Test where it’s either everything or nothing, and that is not a useful measure of progress. Swapping the rounds would be a good idea for next time.

It was nice to see serious competitors with a wide variety of technology tackling the problem, and although the overall results are unimpressive, I am pleased that my partial solution did as well as some academic efforts, with a minimum of resources at that. I am not disappointed in my common sense axioms as many of them were well applicable in this test, including for pronouns that weren’t graded. I will broaden their application to ambiguous locations and indirect object relations, where I have greater need for them.

However, my main interest is the development of intelligent processes and I do not intend to linger on this aspect of language processing more than necessary. It is worth remembering that much can be said without ambiguity. Though common sense has widespread application, it ultimately serves to filter and limit possibilities, while the possibilities in areas like problem solving and planning have yet to expand. For that reason I do not expect human levels of common sense to be reached within ten years either, but we can certainly make strides towards.

No comments:

Post a Comment