In this paper, published in the INLG Proceedings, we investigate two different denoising objectives to pre-train BART (PDF, 315 KB), LED, and DialogLED transformer models. We subsequently fine tune them to obtain automatic summarization models using a medical conversation dataset with summaries written by medical professionals.
Summarizing medical encounters and documents automatically has been a topic of significant interest in the past years (Finley et al., 2018; Joshi et al., 2020; Enarvi et al., 2020; Yim and Yetisgen, 2021; Krishna et al., 2021; Zhang et al., 2021) with transformer models leading the state of the art. However, the domain of pre-training data for widely available models is often significantly different from that of the target medical domain, and additional pre-training in a domain related to the fine-tuning task can provide significant benefit (Gururangan et al., 2020).
We demonstrate that additional pre-training with unlabeled doctor-patient conversations improves the downstream performance on the task of generating medical summaries from conversation transcripts. We measure this performance increase using several different automatic metrics.
The pre-training dataset is composed of 83,605 human-transcribed doctor-patient conversations involving doctors from many different specialties. HPI summaries are available for a subset of 1,342 conversations across internal medicine and primary care specialties. There are an average of 17 reference summaries per doctor-patient conversation. The median number of tokens in a conversation is 1,334 and the 95th percentile approximately corresponds to 5,120 tokens.
We pre-train BART (1,024 token limit), LED (5,120 token limit), and DialogLED (5,120 token limit) using two different denoising tasks: text infilling, which masks out random spans of tokens in the conversation and tries to reconstruct the original conversation, and window-based denoising, which applies several dialog-specific sources of noise to a contiguous window of the conversation and tries to reconstruct the window. Window-based denoising was devised by the authors of the DialogLED paper (Zhong et al., 2021), who continue pre-training LED on long dialog data.
Other than the differences in sequence length, we fine tune all models in identical fashion on the dataset containing HPI summaries and evaluate the generated summaries using several automatic metrics: ROUGE; UMLS concept-based evaluation, which extracts relevant strings from a summary and matches them to clinical concepts in the UMLS database; and NER concept-based evaluation, which uses a clinical named entity recognition (NER) model to predict the clinical concepts in a summary.
In all cases, we find that ROUGE F1 scores improve with additional in-domain pre-training, which indicates that pre-training leads to improved overlap between the generated and reference summaries. We also find that in-domain pre-training almost always improves both concept-based metrics. LED-large pre-trained with the window-based denoising task tends to perform best as measured by ROUGE and DialogLED-large pre-trained with the text infilling task leads to the strongest models in terms of concept-based scores.
We also see a clear benefit of using long-sequence transformers (LED and DialogLED) over frequently used models like BART. The following figure illustrates that the drop in performance on long conversations is much less significant with LED-large than it is with BART-large.
The differences also become clear when considering some examples (below). While it is not easy to identify that the output from an in-domain pre-trained model is better than that of the corresponding baseline model, we see clear differences between the long-sequence LED model and BART. BART misses two important concepts because they occur later in the conversation.
While all models produce fluent text, they make some additional errors. For example, they get confused about the duration that a medication has been taken for, and they can mix up who said something in the conversation (e.g., “Tylenol or Advil” and “vitamin D” were mentioned by the doctor, not the patient). Interestingly, while high cholesterol is part of the patient’s history, it is not the main reason for the visit, which BART incorrectly assumes but is correctly identified by pre-trained LED-large.
While further improvements are necessary to reliably employ automatic summarization models in clinical settings, in-domain pre-training is a useful strategy to improve the summarization quality. Furthermore, long-sequence transformers are clearly beneficial for extracting more relevant concepts from long conversations, which is not possible with conventional transformer models.