• Revisiting text decomposition methods for NLI-based factuality scoring of summaries

    Authors

    John Glover, Federico Fancellu, Vasudevan Jagannathan, Matthew R. Gormley, Thomas Schaaf

    Full paper (PDF, 502 KB)


    • Introduction

      At 3M Health Information Systems (HIS), we have been exploring the use of modern text generation models to help with clinical documentation tasks. These models are capable of producing fluent and coherent text. However, they are still prone to various forms of “hallucination”, generating statements that are not supported by their input. This is one of the key challenges that must be overcome before these models can be deployed at scale, and so we have seen a growing interest in being able to measure the degree accurately and automatically to which machine generated output is non-factual. We also tackle this problem in our recent paper "Revisiting text decomposition methods for NLI-based factuality scoring of summaries", published at the Generation, Evaluation & Metrics (GEM) Workshop at EMNLP 2022 (and listed as one of the Outstanding Papers). Specifically, we deal with the question of how to detect factual inconsistency in machine-generated summaries, using only the input text as a reference (we don't rely on having additional human-written reference summaries).

    • Automatic evaluation of machine generated summaries

      Despite many well known drawbacks, ROUGE is still the most common automatic metric for summarization. It measures the overlap of n-grams between reference and hypothesis/candidate summaries. ROUGE was originally defined as being a recall measure, but we often use ROUGE F1. To highlight some of the issues with just using ROUGE to evaluate our summarization systems, we'll use the following example:

    Reference sentence: His weight went up 6 lbs and he reports his diet is not good. He orders take out, binge eat 4-5 times a week, overeats, eats for comfort.

    Here we have a reference sentence, and three possible machine generated hypotheses to evaluate against this reference. On the right we see the corresponding ROUGE 1/2 F1 scores. The colors indicate the factual accuracy of each span, with green being correct, red being incorrect, and yellow being ambiguous or subjective. We see that Hypothesis one is largely correct and has the highest ROUGE scores. However, hypothesis two, which is the second highest scoring reference, is largely incorrect. Hypothesis 3 is mostly correct, with the phrase "several times a week" perhaps being subjective, but as the wording is slightly different to the reference it scores worst in terms of ROUGE. So, we can clearly see here that the factuality of the hypotheses is not captured at all by ROUGE. This is not a new finding, a quote from this survey paper highlights the issue:

    "similarity-based evaluations reward surface similarity at the expense of meaning and may be “fooled” by similar-looking, yet semantically different, outputs"

    • Factuality scoring based on Natural Language Inference

      One promising approach to measuring factuality is based around Natural Language Inference or NLI. In the typical NLI setup, a model is presented with a pair of sentences, and outputs a distribution over the classes of {entailment, neutral, contradiction}. As NLI seems conceptually similar to factuality scoring (and is now quite a well-studied problem), several prior studies have asked “can we reuse NLI models for factuality scoring? And if so, how?”

      One way to use NLI models for factuality scoring is to set the NLI context or "premise" to be the full input text, with the summary forming the NLI hypothesis, and then take the factuality score to be some function ƒz of the model output distribution. There are some potential problems with this approach:
       

      • NLI models are usually trained with sentence pairs as input and can suffer performance degradation with the longer contexts that arise in summarization.
      • The majority of modern NLI models are based on architectures such as the Transformer that use fixed-length input sizes, and it may not be possible for a full document and summary pair to fit into this context.

      Another way to use NLI for factuality scoring is what we refer to here as “decomposition-based” scoring, and was introduced in 2019 by Falke et al. It can be explained as follows:

      First, we decompose the document and summary into sentences, here labelled D1...M and S1...N respectively.

    Step 1: Decompose documents and summaries into sentences

    Then, all sentence pairs are passed through an NLI model, and we extract the probability of the entailment class, producing an MxN matrix.

    Step 2: Score all sentence pairs using NLI

    From here we just need some way to collapse this matrix into a single value to create the factuality score. Falke et al. suggested taking the max over the columns, effectively selecting the strongest evidence in favor of each summary sentence. Finally, we can then take the average of this 1xN matrix to produce the final score.

    Step 3: Aggregate scores to produce a final factuality score

    We note that these techniques are theoretically agnostic to the choice of the specific NLI model. However, at the time of publication in 2019, Falke et al. concluded that "current NLI models are not yet robust enough for our downstream task".

    • Revisiting decomposition-based factuality scoring

      In our recent paper, we revisited this decomposition-based scoring idea, now using a newer set of NLI models, and inspired by other recent studies by Laban et al. and Schuster et al. We found that recent NLI models can indeed perform significantly better at this task than models created even just a couple of years ago. Given these improvements, we then asked whether this sentence level decomposition is still the best way make use of NLI models for factuality scoring. Concretely, we proposed methods for decomposing the input at units in between a sentence and the full document, and at units shorter than a sentence. We describe both ideas below.

      Longer contexts: Top-K
      As we are working with summaries, which are compressing and aggregating the input in some way, we can expect there to be instances where more than one document sentence is needed to correctly measure the factuality of a single summary statement. We propose a middle ground between using the full document as the premise and sentence level, that we call Top-K. It is computed as follows:

      First, we decompose the document and summary into sentences as before and score all sentence pairs using the NLI model.

    Step 1: Decompose documents and summaries into sentences

    Step 2: Score all sentence pairs using NLI

    Then, for each summary sentence, we rank the document sentences by P(entail), and concatenate the top K of these to form a new premise string that we denote σn here. We then rescore each summary sentence using these concatenated contexts as the premise, and finally take the average of the entailment scores as the factuality score.

    Step 3: For each Sn, rank D1,...,M by P(Entail), concatenate Top-K sentences to form a new premise string σn; Step 4: Rescore using NLI

    • Shorter contexts: SCUs
      To assess levels of granularity that are smaller than sentences, we first considered how people do content-based evaluation of summaries. The Pyramid Method (2004) is still thought of as one of the gold standards for human evaluation of summary content, and so this was our starting point. It’s also based on the idea of granular decomposition, although this time into singular statements of fact referred to as Summary Content Units or SCUs, that can then be extracted and compared between summaries and references.

      Here is an example of SCU transformation:

      “His weight went up six pounds and he report his diet is not good.”

      could be mapped to SCUs:
       

      1. “His weight went up 6 lbs.”
      2. “He reports his diet is not good.”

      As we have no gold SCU data for our evaluation datasets, we approximate SCU decomposition using an automatic method that was proposed by Zhang and Bansal last year. Their approach is based on semantic role labelling with some additional rules/heuristics. Their code is open source, and we used it without modification.

      Here is how our example looks when we run it through the SCU approximation:

      “His weight went up six pounds and he report his diet is not good.”

      using the automatic method, is mapped to SCUs:
       

      1. “His weight went up.”
      2. “His weight went 6 lbs.”
      3. “He reports his diet is not good”
      4. “His diet is not good”

      The output in this case is more verbose, and probably less correct, although there is still some subjectivity in judging this. For example, in some cases it may be desirable to distinguish between capturing the fact that weight went up, and that it went up by a specific amount.

    • Experiments and evaluation

      We evaluated our methods on the SummaC benchmark, which is comprised of the following six datasets: CoGenSumm, XSumFaith, Polytope, FactCC, SummEval, and FRANK. All are broadly from the same domain, namely English news articles, and SummaC standardizes evaluation by casting each task as binary classification, and then measuring performance using balanced accuracy.

      We note that one of the stated aims of the FRANK dataset was to go beyond binary factuality metrics, as the authors mention that binary scores can be difficult for people to agree on. And so, they included more granular factuality scores, where the factuality for a summary is a value between 0 and 1. We also evaluated our methods on the FRANK dataset using their original metrics (two correlation coefficients) to see if this leads to different conclusions.

      For a granular breakdown of our results, we refer people to the paper, but at a high-level our findings are:
       

      • Using newer NLI models (such as the VitaminC-MNLI model) leads to strong and robust performance gains in all evaluation settings.
      • It's generally better to use sentence-level decomposition for the summary side of the NLI pair.
      • On SummaC, we do better with longer contexts on the premise side. Using as much of the document as we can does best, followed by our Top-K method.
      • On FRANK using the original metrics, we see a noticeable drop in performance when using the full document as the premise/context. Our Top-K method does best in terms of Pearson correlation, with sentence-level being the strongest in terms of Spearman.
      • On the benchmark tasks, we see no performance improvement when going below the sentence level and using approximate SCUs.

      In terms of our earlier example, we show below that when using our NLI-based scoring methods, we can now produce results that are more closely aligned with our intuitions about the quality of the three hypotheses. On the right of the table we see the results of using our Top-K NLI premise and sentence-level NLI hypothesis. We see that hypothesis two is correctly ranked the lowest (by far) of the three hypotheses in terms of factuality, and that hypothesis one and three are scored as being approximately the same.

    His weight went up 6 lbs and he reports his diet is not good. He orders take out, binge eat 4-5 times a week, overeats, eats for comfort.

    • Conclusions

      We revisited recent work on using NLI models to do factuality scoring of summaries and found that techniques based on decomposing documents and summaries into finer levels of granularity work well (although there is still room for improvement). We proposed a new way to select context for scoring with dealing with longer input documents, that holds up well across evaluation on six different datasets. In general however, we find that there is no "correct" level of granularity for all tasks, and still see considerable variation in the performance across different datasets. And so, we note that care must be taken when assessing what to use for your downstream task of interest. So far we see no additional performance benefit in going below the sentence level and using SCUs on these benchmarks, but the SCU decomposition does perform competitively across both benchmarks (and as they are more granular, SCUs may be more interpretable in some cases).