• Leveraging pretrained models for automatic summarization of doctor-patient conversations

    Longxiang Zhang¹, Renato Negrinho², Arindam Ghosh¹, Vasudevan Jagannathan¹, Hamid Reza Hassanzadeh¹, Thomas Schaaf¹, Matthew R Gormley²
    ¹3M HIS ²Carengie Mellon University
    3M HIS scribe research team, 3M HIS

    • Introduction

      In this paper, published in EMNLP Findings, we explore the feasibility of using pretrained transformer models for automatically summarizing doctor patient conversations directly from transcripts.

      In recent years, pretrained transformer models (Lewis et. al., 2019; Devlin et al., 2018; Zaheer et al., 2020; Brown et al., 2020) have been responsible for many breakthroughs in natural language processing (NLP), such as improved state-of-the-art performances for a broad range of tasks and the ability to train effective models for low resource tasks.

      The demonstrated capability of transfer learning using large pretrained transformer models has led to widespread interest in leveraging these models in less standard NLP domains. Automatic generation of medical summaries from doctor-patient conversation transcripts presents several challenges such as the limited availability of supervised data, the substantial domain shift from the text typically used in pretraining, and potentially the long dialogues that exceed the length limitation of conventional transformers.

      We show that pretrained transformer model BART (Lewis et al., 2019) can be fine-tuned to generate highly fluent summaries of surprisingly good quality even with a small dataset.

    • Dataset

      The dataset used in this paper is based on a collection of 1342 de-identified doctor patient conversations of two major specialties: internal medicine and primary care are annotated by medical scribes using our annotation environment specifically designed for the task. The scribes listen to the conversation audio and fill in necessary information in a simulated electronic health record (EHR) system. The EHR simulator consists of 14 distinct sections such as history of present illness (HPI) and review of system (ROS). We collect multiple references for each conversation, for a total of 21588 annotations. We focus on generating the HPI section here.

    • Methods

      We explore multiple ways to generate summaries. All models rely on fine-tuning of pretrained BART models. But in this post, we discuss a specific multi-stage model that provided an effective way to overcome length limitations and gave the best results. The paper provides detailed discussions of all the different single and multi-stage models that were tried.

    • Multistage summarization using chunking

      Task of dialogue summarization in the multi-stage model is achieved in two steps: summarizing portions of input conversation and rewriting aggregated summaries of each portion. The figure below shows the multi-stage chunking approach.

    Visual showing stages of chunking for a conversation between a doctor and a patient

    Four stages: (1) dialogue between a doctor and patient color-coded into chunks; (2) the same dialogue broken out with the header (first lines of dialogue from the patient and doctor) shown above each chunk; (3) stage 1 summarizer pulls a short summary from each of the three chunks; (4) stage 2 summarizer shows the final overall summarization

    • We create chunks of transcript from each conversation where each chunk consists of two components: a fixed length "header" that is selected from the beginning of the conversation and is present in all chunks; a variable "body" that is created by a sliding scan of the rest of the conversation.

      In the figure, Summ1 and Summ2 refers to stage 1 and stage 2 summarizer. The same header (denoted by the yellow box) is added to the beginning of every chunk, serving as context, and the complete summaries are used as targets for fine-tuning both Summ 1 and Summ 2 models.

    • Results

      The following table shows the chunking method provides the best results when Rouge 1, Rouge 2 and Rouge-L was used.

      • Performance breakdown by input length
        • Multi-stage (Chunking) training outperforms single-state across all buckets

    Bar charts comparing single-stage, multi-stage chunking and multi-stage sentbert across different conversational lengths. The chunking method provides the best results for Rouge 1, Rouge 2 and Rouge-L.

    The following sample compares a reference summary with what the above chunking model produces for a specific conversation.

    Multistage chunking when compared to a reference summary not only finds concepts that are in the reference summary but also finds concepts derived from conversations which are not in the reference summary.

    Reference summary:

    Male patient presents today for followup of his hyperglycemia. His hemoglobin has improved to 7.3 to 7.4. He reports being sick after he recently traveled via plane. He notes that he has been sick for the past 3 days. He is feeling a bit better today, but still feels tired and sleepy. He says that he has been taking Cepacol for the symptoms.

    Multistage (Chunking)

    The patient presents to the clinic today for a follow-up visit. He has a history [green highlighted text] of diabetes mellitus [end highlight]. His A1C today is 7.3. He reports that he has been [green highlighted text] experiencing a cold for the past few days [end highlight]. He states that [yellow highlighted text] he is still feeling fatigued [end highlight]. He is not [yellow highlighted text] eating as well as he would like [end highlight]. He would like [green highlighted text] a refill of his Cialis and Viagra [end highlight]. He is [green highlighted text] currently working for a consultancy and has a contract that lasts until May 2019 [end highlight]. He [yellow highlighted text] will be out of the country for 6 months [end highlight]. He [yellow highlighted text] does not want to go for a long time away [end highlight].

    • Text in green highlights medical findings present in at least one reference summary; text with yellow highlighting shows findings not in reference but are supported by the conversation. The following factors were considered during human evaluation:

      • Fluency: How fluent is the text generated?
      • Relevancy: Are contents relevant for HPI?
      • Missing: Are any key findings missing?
      • Hallucination: Are any findings hallucinated or inaccurate?
      • Repetition: Are there repetitive sentences?
      • Contradiction: Are any sentences contradicting each other?

      The generated summaries are surprisingly fluent and in many of the examples we evaluated manually, even better than human summaries. We did not see any instances where the summary had contradicting sentences or repetition. However, there were several cases where information was missing or inaccurate.

    • Conclusion

      We show the feasibility of summarizing doctor-patient conversation directly from transcripts without an extractive component. We fine-tune various pretrained transformer models for the task of generating the HPI section in a typical medical report from the transcript and achieve surprisingly good performance through pretrained BART models.