• Extract and abstract with BART for clinical notes from doctor-patient conversations

    Jing Su, Longxiang Zhang, Hamid Reza Hassanzadeh, Thomas Schaaf 
    3M HIS scribe research team

    Full paper (PDF, 433 KB)

    • Introduction

      Doctor-Patient Conversations (DoPaCos) exhibit a natural topic-segmented structure: doctors tend to guide the topics in order to obtain required information in a loosely stable order (e.g., diagnosis followed by assessment and plan). We propose an extract-and-abstract approach to leverage such topical structure for the task of automatically summarizing the HPI section of a clinical note from DoPaCos. We show the improvement in summarization by removing distracting information from DoPaCos and an easy-to-implement pipeline design of our approach. Figure 1 shows the architecture of extract and abstract that we have implemented.

    Figure 1: A diagram of our extract-and-abstract approach to generate Subjective (or ihpi) section of SOAP notes.

    Figure 1: On the left we have a doctor and patient conversing. The next column shows a transcript highlighting conversation that is relevant for clinical note creation. The highlighted elements are extracted using a ML-model which is shown in the next column. Another model abstracts the content to create SOAP notes which consist of Subjective, Objective, Assessment and Plan.

    • Dataset

      We use two disjoint datasets (Dataset A and Dataset E) in this paper. Both datasets contain DoPaCo transcripts. The difference is that DoPaCos in dataset A are paired with HPI summaries, while dataset E come with labels assigned to each utterance in all conversations. Each utterance could be associated with one of four Span Labels: ihpi (inclusive HPI), pe (Physical Examination), a&p (Assessment & Plan), none (Empty label). “Span” refers to the characteristics that those labels almost always span across multiple consecutive utterances. 

      Dataset E is leveraged to train an utterance selection model and predict ihpi utterances in dataset A. Only the predicted ihpi utterances in dataset A are then used to fine-tune a summarization model to generate HPI summaries.

    A conversation transcript with span labels associated with portions of dialogue

    Table 1. An example Doctor-Patient Conversation with annotated span labels per utterance. A Span label can be ihpi, pe, a/p or empty.

    • Methods

      Our proposed extract-and-abstract approach involves, two key models: the utterance selection model and the summarization model. Figure 2 shows the pipeline design.

    Figure 2: A pipeline of supervised learning approach in utterance selection and summarization.

    • Figure 2: A pipeline of supervised learning approach in utterance selection and summarization. In the figure the module annotated span labels in Dataset E are used to train BERT-based classifiers. Such classifiers are then used to select HPI utterances in Dataset A. After an additional filtering step, we use the set of predicted HPI utterances as source texts to fine-tune BART summarizer module. In this step the training target is human annotated HPI summaries in Dataset A.

    • The goal of the utterance selection model is to filter utterances unrelated to HPI from DoPaCos by predicting the span label of each utterance. We explore two approaches to training such a model: Context model and Sequence model. In the first approach, N consecutive utterances are concatenated as one sample to cover context information. We use these contextualized samples to train a BERT+MLP classifier to predict the span label of the last utterances in the input in one sample. In the second approach, we employ a BERT+LSTM+CRF (PDF, 728 KB) architecture to learn the correlation between a sequence of utterances and their annotated sequence of span labels.

      The summarization model we adopt is BART. This is a transformer-based encoder-decoder model that has an established record in deep learning community of outstanding performance in automatic summarization of news articles; our own researchers at 3M also demonstrated its capability of summarizing medical conversations by adapting the model to our own domain.

      • The two models are connected by filtering (extract): by running our trained utterance selection model on every utterance in Dataset A, we keep only utterances with predicted ihpi label in each conversation as the input to the BART summarizer. Although a significant amount of utterances is filtered in this step, 31.4% of the filtered DoPaCos still have more than 1024 tokens, exceeding the input token limit of BART model. Therefore, we introduce direct truncation, adaptive thresholding and slicing methods as additional filtering to retain the most relevant information for summarization, while keeping the input length within acceptable range (<= 1024 tokens, or approximately 640 words) of the BART model. Direct truncation: HPI utterance collections longer than 1024 tokens are cut off at this token limit
      • Adaptive thresholding: we define a set of probability threshold values. Thresholds from low to high are applied to filter HPI utterances predicted by Context model until the tokens limit is satisfied.
      • Slicing: for HPI utterance collections longer than the tokens limit, we split them into consecutive chunks such that the limit is satisfied by each. One chunk is used as a standalone training sample for the summarizer.
    • Results

      When we evaluate the generated HPI summaries with Rouge 1, Rouge 2 and Rouge-L metrics, the slicing and adaptive thresholding methods on context model (N=4) both show an advantage over the direct truncation method. For more on the detailed results using Rouge, refer to our paper (PDF, 433.39 KB).

      Experiments indicate that BART model can benefit from a longer context in the input but is also exposed to a position bias that tends to favour early utterances in the conversation.

      The quickUMLS metrics (shown below in Table 2) confirms the advantage of the slicing method with context model (N=4). Our models consistently improve over single stage summarization and multi-stage chunking proposed by our team in an earlier paper.

    Table 2: quickUMLS evaluation of BART-Large models over the test set.

    • Table 2: Slicing on Context model reports the best F1 with more than 2% advantage over the Sequence model with Direct truncation. The Single stage first 640 baseline model has a lower F1 score when compared with the Slicing method. Comparing groups two and three, we believe BART can benefit from a longer context in the input but is also exposed to a position bias that favours starting utterances in the conversation.

    • Conclusion

      We proposed an extract-and-abstract approach to automatic summarization of History of Present Illness from doctor-patient conversations. This approach shows improvement in concept-based evaluation with comparable ROUGE scores. Summaries from the proposed solution achieve better coverage of critical medical information.