Authors
Jing Su, Longxiang Zhang, Hamid Reza Hassanzadeh, Thomas Schaaf
3M HIS scribe research team
Doctor-Patient Conversations (DoPaCos) exhibit a natural topic-segmented structure: doctors tend to guide the topics in order to obtain required information in a loosely stable order (e.g., diagnosis followed by assessment and plan). We propose an extract-and-abstract approach to leverage such topical structure for the task of automatically summarizing the HPI section of a clinical note from DoPaCos. We show the improvement in summarization by removing distracting information from DoPaCos and an easy-to-implement pipeline design of our approach. Figure 1 shows the architecture of extract and abstract that we have implemented.
Figure 1: On the left we have a doctor and patient conversing. The next column shows a transcript highlighting conversation that is relevant for clinical note creation. The highlighted elements are extracted using a ML-model which is shown in the next column. Another model abstracts the content to create SOAP notes which consist of Subjective, Objective, Assessment and Plan.
We use two disjoint datasets (Dataset A and Dataset E) in this paper. Both datasets contain DoPaCo transcripts. The difference is that DoPaCos in dataset A are paired with HPI summaries, while dataset E come with labels assigned to each utterance in all conversations. Each utterance could be associated with one of four Span Labels: ihpi (inclusive HPI), pe (Physical Examination), a&p (Assessment & Plan), none (Empty label). “Span” refers to the characteristics that those labels almost always span across multiple consecutive utterances.
Dataset E is leveraged to train an utterance selection model and predict ihpi utterances in dataset A. Only the predicted ihpi utterances in dataset A are then used to fine-tune a summarization model to generate HPI summaries.
Table 1. An example Doctor-Patient Conversation with annotated span labels per utterance. A Span label can be ihpi, pe, a/p or empty.
Our proposed extract-and-abstract approach involves, two key models: the utterance selection model and the summarization model. Figure 2 shows the pipeline design.
Figure 2: A pipeline of supervised learning approach in utterance selection and summarization. In the figure the module annotated span labels in Dataset E are used to train BERT-based classifiers. Such classifiers are then used to select HPI utterances in Dataset A. After an additional filtering step, we use the set of predicted HPI utterances as source texts to fine-tune BART summarizer module. In this step the training target is human annotated HPI summaries in Dataset A.
The goal of the utterance selection model is to filter utterances unrelated to HPI from DoPaCos by predicting the span label of each utterance. We explore two approaches to training such a model: Context model and Sequence model. In the first approach, N consecutive utterances are concatenated as one sample to cover context information. We use these contextualized samples to train a BERT+MLP classifier to predict the span label of the last utterances in the input in one sample. In the second approach, we employ a BERT+LSTM+CRF (PDF, 728 KB) architecture to learn the correlation between a sequence of utterances and their annotated sequence of span labels.
The summarization model we adopt is BART. This is a transformer-based encoder-decoder model that has an established record in deep learning community of outstanding performance in automatic summarization of news articles; our own researchers at 3M also demonstrated its capability of summarizing medical conversations by adapting the model to our own domain.
When we evaluate the generated HPI summaries with Rouge 1, Rouge 2 and Rouge-L metrics, the slicing and adaptive thresholding methods on context model (N=4) both show an advantage over the direct truncation method. For more on the detailed results using Rouge, refer to our paper (PDF, 433.39 KB).
Experiments indicate that BART model can benefit from a longer context in the input but is also exposed to a position bias that tends to favour early utterances in the conversation.
The quickUMLS metrics (shown below in Table 2) confirms the advantage of the slicing method with context model (N=4). Our models consistently improve over single stage summarization and multi-stage chunking proposed by our team in an earlier paper.
Table 2: Slicing on Context model reports the best F1 with more than 2% advantage over the Sequence model with Direct truncation. The Single stage first 640 baseline model has a lower F1 score when compared with the Slicing method. Comparing groups two and three, we believe BART can benefit from a longer context in the input but is also exposed to a position bias that favours starting utterances in the conversation.
We proposed an extract-and-abstract approach to automatic summarization of History of Present Illness from doctor-patient conversations. This approach shows improvement in concept-based evaluation with comparable ROUGE scores. Summaries from the proposed solution achieve better coverage of critical medical information.