Authors
John Glover, Federico Fancellu, Vasudevan Jagannathan, Matthew R. Gormley, Thomas Schaaf
At 3M Health Information Systems (HIS), we have been exploring the use of modern text generation models to help with clinical documentation tasks. These models are capable of producing fluent and coherent text. However, they are still prone to various forms of “hallucination”, generating statements that are not supported by their input. This is one of the key challenges that must be overcome before these models can be deployed at scale, and so we have seen a growing interest in being able to measure the degree accurately and automatically to which machine generated output is non-factual. We also tackle this problem in our recent paper "Revisiting text decomposition methods for NLI-based factuality scoring of summaries", published at the Generation, Evaluation & Metrics (GEM) Workshop at EMNLP 2022 (and listed as one of the Outstanding Papers). Specifically, we deal with the question of how to detect factual inconsistency in machine-generated summaries, using only the input text as a reference (we don't rely on having additional human-written reference summaries).
Despite many well known drawbacks, ROUGE is still the most common automatic metric for summarization. It measures the overlap of n-grams between reference and hypothesis/candidate summaries. ROUGE was originally defined as being a recall measure, but we often use ROUGE F1. To highlight some of the issues with just using ROUGE to evaluate our summarization systems, we'll use the following example:
Here we have a reference sentence, and three possible machine generated hypotheses to evaluate against this reference. On the right we see the corresponding ROUGE 1/2 F1 scores. The colors indicate the factual accuracy of each span, with green being correct, red being incorrect, and yellow being ambiguous or subjective. We see that Hypothesis one is largely correct and has the highest ROUGE scores. However, hypothesis two, which is the second highest scoring reference, is largely incorrect. Hypothesis 3 is mostly correct, with the phrase "several times a week" perhaps being subjective, but as the wording is slightly different to the reference it scores worst in terms of ROUGE. So, we can clearly see here that the factuality of the hypotheses is not captured at all by ROUGE. This is not a new finding, a quote from this survey paper highlights the issue:
"similarity-based evaluations reward surface similarity at the expense of meaning and may be “fooled” by similar-looking, yet semantically different, outputs"
One promising approach to measuring factuality is based around Natural Language Inference or NLI. In the typical NLI setup, a model is presented with a pair of sentences, and outputs a distribution over the classes of {entailment, neutral, contradiction}. As NLI seems conceptually similar to factuality scoring (and is now quite a well-studied problem), several prior studies have asked “can we reuse NLI models for factuality scoring? And if so, how?”
One way to use NLI models for factuality scoring is to set the NLI context or "premise" to be the full input text, with the summary forming the NLI hypothesis, and then take the factuality score to be some function ƒz of the model output distribution. There are some potential problems with this approach:
Another way to use NLI for factuality scoring is what we refer to here as “decomposition-based” scoring, and was introduced in 2019 by Falke et al. It can be explained as follows:
First, we decompose the document and summary into sentences, here labelled D1...M and S1...N respectively.
Then, all sentence pairs are passed through an NLI model, and we extract the probability of the entailment class, producing an MxN matrix.
From here we just need some way to collapse this matrix into a single value to create the factuality score. Falke et al. suggested taking the max over the columns, effectively selecting the strongest evidence in favor of each summary sentence. Finally, we can then take the average of this 1xN matrix to produce the final score.
We note that these techniques are theoretically agnostic to the choice of the specific NLI model. However, at the time of publication in 2019, Falke et al. concluded that "current NLI models are not yet robust enough for our downstream task".
In our recent paper, we revisited this decomposition-based scoring idea, now using a newer set of NLI models, and inspired by other recent studies by Laban et al. and Schuster et al. We found that recent NLI models can indeed perform significantly better at this task than models created even just a couple of years ago. Given these improvements, we then asked whether this sentence level decomposition is still the best way make use of NLI models for factuality scoring. Concretely, we proposed methods for decomposing the input at units in between a sentence and the full document, and at units shorter than a sentence. We describe both ideas below.
Longer contexts: Top-K
As we are working with summaries, which are compressing and aggregating the input in some way, we can expect there to be instances where more than one document sentence is needed to correctly measure the factuality of a single summary statement. We propose a middle ground between using the full document as the premise and sentence level, that we call Top-K. It is computed as follows:
First, we decompose the document and summary into sentences as before and score all sentence pairs using the NLI model.
Then, for each summary sentence, we rank the document sentences by P(entail), and concatenate the top K of these to form a new premise string that we denote σn here. We then rescore each summary sentence using these concatenated contexts as the premise, and finally take the average of the entailment scores as the factuality score.
Shorter contexts: SCUs
To assess levels of granularity that are smaller than sentences, we first considered how people do content-based evaluation of summaries. The Pyramid Method (2004) is still thought of as one of the gold standards for human evaluation of summary content, and so this was our starting point. It’s also based on the idea of granular decomposition, although this time into singular statements of fact referred to as Summary Content Units or SCUs, that can then be extracted and compared between summaries and references.
Here is an example of SCU transformation:
“His weight went up six pounds and he report his diet is not good.”
could be mapped to SCUs:
As we have no gold SCU data for our evaluation datasets, we approximate SCU decomposition using an automatic method that was proposed by Zhang and Bansal last year. Their approach is based on semantic role labelling with some additional rules/heuristics. Their code is open source, and we used it without modification.
Here is how our example looks when we run it through the SCU approximation:
“His weight went up six pounds and he report his diet is not good.”
using the automatic method, is mapped to SCUs:
The output in this case is more verbose, and probably less correct, although there is still some subjectivity in judging this. For example, in some cases it may be desirable to distinguish between capturing the fact that weight went up, and that it went up by a specific amount.
We evaluated our methods on the SummaC benchmark, which is comprised of the following six datasets: CoGenSumm, XSumFaith, Polytope, FactCC, SummEval, and FRANK. All are broadly from the same domain, namely English news articles, and SummaC standardizes evaluation by casting each task as binary classification, and then measuring performance using balanced accuracy.
We note that one of the stated aims of the FRANK dataset was to go beyond binary factuality metrics, as the authors mention that binary scores can be difficult for people to agree on. And so, they included more granular factuality scores, where the factuality for a summary is a value between 0 and 1. We also evaluated our methods on the FRANK dataset using their original metrics (two correlation coefficients) to see if this leads to different conclusions.
For a granular breakdown of our results, we refer people to the paper, but at a high-level our findings are:
In terms of our earlier example, we show below that when using our NLI-based scoring methods, we can now produce results that are more closely aligned with our intuitions about the quality of the three hypotheses. On the right of the table we see the results of using our Top-K NLI premise and sentence-level NLI hypothesis. We see that hypothesis two is correctly ranked the lowest (by far) of the three hypotheses in terms of factuality, and that hypothesis one and three are scored as being approximately the same.
We revisited recent work on using NLI models to do factuality scoring of summaries and found that techniques based on decomposing documents and summaries into finer levels of granularity work well (although there is still room for improvement). We proposed a new way to select context for scoring with dealing with longer input documents, that holds up well across evaluation on six different datasets. In general however, we find that there is no "correct" level of granularity for all tasks, and still see considerable variation in the performance across different datasets. And so, we note that care must be taken when assessing what to use for your downstream task of interest. So far we see no additional performance benefit in going below the sentence level and using SCUs on these benchmarks, but the SCU decomposition does perform competitively across both benchmarks (and as they are more granular, SCUs may be more interpretable in some cases).