An ironically long article about a summariser browser add-on.
Introductory anecdote:
Due to my interest in artificial intelligence I can’t help but get
exposed to online articles about the subject. But as illustrated in the previous article*,
this particular field is flooded with speculative futurism, uninformed
opinions and sheer clickbait, wasting my time more often than not.
But I also happen to be an amateur language programmer, so I can do something about it. I spent years developing an A.I. program
that can comprehend text through grammar and semantics, and I figured I
might as well put it to use. So I had added a function that would read
whatever document was on my screen, filter out all unimportant
sentences, and show me the remainder. It worked pretty well, and
required surprisingly few of the A.I.’s resources. Now, I’ve ported this
summarisation function to a browser add-on, so that everyone can
summarise online articles at the click of a button:
Problem statement: Statistics are average
Document summarisers do of course already exist, and their methods are inventively inhuman:
• The simplest method, used in e.g. SMMRY,
counts how often each word occurs in the text, and then picks out
sentences that contain the most-occurring words, which are presumably
the main topics. Common words like “the” should of course be ignored,
either with a simple blacklist, or with another word-counting technique
by the confusing name “Term Frequency – Inverse Document Frequency”: How
frequently a word occurs in the text versus how common it is in the English language.
• Another common method
looks at each paragraph and picks out one sentence that has the most
words in common with its neighbouring sentences, therefore covering the
most of the paragraph’s subject matter. Sentence length is factored in
so that it won’t just always pick the longest sentence.
• The most advanced method, “Latent Semantic Analysis”, picks out
sentences that contain frequently occurring, strongly associated words.
i.e. words that are often used together in a sentence are presumably
associated with one and the same topic. This way synonyms of the main
topics are also covered.
In my experiences however I observed one problem with these
statistical methods: Although they succeeded in retrieving an average of
the subject matter, they tended to omit the point that the writer was
trying to make, and that is the one thing I want to know. This oversight
stands to reason: A writer’s conclusion is often just one or two
sentences near the end, so its statistical footprint is small, and like
an answer to a question, it doesn’t necessarily share many words with
the rest of the article. I decided to take a more psychological
approach. Naturally, I ended up re-inventing a method that dates all the
way back to 1968.
A writer’s approach to summarisation
My target for the summariser add-on was a combination of two things: It
should extract what the writer found important, minus what I find
unimportant. Unimportant being things like introductions, asides,
examples, inconcrete statements, speculation and other weak arguments.
Word choice
While writing styles vary, all writers choose their words to emphasise
or downtone what they consider important. Consider the difference
between “This is very important.” and “Some may consider this important.” In
a way the writer has already filtered the information for you. With
this understanding, I set the summariser to look for several types of
cues in the writer’s choice of words:
• Examples: “e.g.”, “for instance”, “among other”, “just one of”
• Uncertainty: “may”, “suppose”, “conjecture”, “question”, “not clear”
• Commonly known: “standard”, “as usual”, “of course”, “obvious”
• Advice: “recommendation”, “require”, “need”, “must”, “insist”
• Main arguments: “problem”, “goal”, “priority”, “conclude”, “decision”
• Literal importance: “negligible”, “insignificant”, “vital”, “valuable”
• Strong opinions: “horrible”, “fascinate”, “astonishing”, “extraordinary”
• Amounts: “some”, “a few”, “many”, “very”, “huge”, “millions”
At this point one may be tempted to take a statistical approach again
and score each sentence for how many positive and negative cues they
contain, but that’s not quite right: There is a hierarchy to the cues
because they differ in meaning. For example, uncertainty like “maybe
very important” makes for a weak argument no matter how many positive
cues it contains. So each type of cue is given a certain level of
priority over others. Their exact hierarchy is a delicate matter of
tuning, but roughly in the order as listed, with negative cues typically
overruling positive cues.
Another aspect that must be taken into account is that amounts affect the cues in linear order:
“It is not important to read” is not equal to “It is important not to read”, even if they contain the same words. Only the latter should be included in the summary.
Sentence weaving
Beside word choice, further cues can be found at sentence level:
• Headers are rarely followed by an important point, as they just stated it themselves.
• Right after a major point, such as a recommendation, tends to follow a sentence with valuable elaboration.
• A sentence ending in a double period is not important itself: It announces that the point follows.
• A question is just a prelude to the point that the writer wants to drive through in the next sentence.
• Cues in sentences that contain references like “the following” reflect
the importance of other sentences, rather than their own.
• Sentences of less than 10 words are usually transitions or afterthoughts, unless word choice tells otherwise.
Along with these cues one should always observe context: If an
important sentence begins with a reference like “This”, then the
preceding sentence also needs to be included in order to make sense,
even if it was otherwise ignorable. Conversely, if the preceding
sentence can be omitted without loss of context, link words like “But”,
“nevertheless”, and “also” should be removed to avoid confusion in the
summary.
Story flow and the lack thereof
Summarisation methods that are based on well formatted academic
text sensibly assume that the first and last sentences of paragraphs
are of particular importance, as they tend to follow a basic story arc:
Introduction -> problem -> obstacles -> climax -> resolution.
Online articles however feature considerably shorter paragraphs, so that
in practice the first sentence has an equal chance of being a trivial
introduction or an important problem statement. Some paragraphs are just
blockquotes or filler contents, and sometimes the “resolution” of the
arc is postponed to entice further reading, as the entire article is a
story arc itself.
But worst of all, many online articles have the dreadful habit of
making every two sentences into a paragraph of their own. Perhaps
because it creates more room for sidebar advertisements.
While I initially awarded some default importance to first and last
sentences, I found that word choice is such an abundantly present cue
that it is a more dependable indicator. Not every blogger is a good
writer, after all. The frequent abuse of paragraph breaks also forced me
to take a different approach in composing the summary: Breaks are only
inserted if the next paragraph contains a highly important point of its
own, otherwise it is considered a continuation. This greatly improved
readability.
Conclusion
The resulting summariser add-on typically reduces well-written articles
to 50 – 40%, down to 30 – 20% for flimsy content. With my approach the
summary can not be restrained to a preset length, but a future
improvement could be to add an adjustable setting to only include
sentences of the highest levels of importance, such as conclusions only.
Another inherent effect of my approach is that if the writer makes
the same point twice, the summary will also include it twice. While
technically correct, this could be amended by comparing sentences for
repeated strings of words, and ideally synonyms as well.
In conclusion, I should say that my summariser is not necessarily
“better” than statistical summarisers, but different, in that it
specifically searches for the main points that the writer wanted to get
across, rather than retrieving the general subject matter. This may suit
other users as well as it does me, and I hope that many will find it
contributes to a better internet experience.
You can install free Chrome and Firefox versions from their web stores:
Below is an example summary, skipping trivia and retrieving the key announcement:
No comments:
Post a Comment