Thomas Schaaf, Longxiang Zhang, Mark Fuhs, Shahid Durrani, Susanne Burger, Monika Woszczyna, Thomas Polzin
3M HIS scribe research team
In this paper, published in 2021 IEEE Automatic Speech Recognition and Understanding Workshop, we explore the feasibility of detecting embedded dictations in recordings of doctor-patient conversations.
The widespread adoption of electronic health record (EHR) systems has changed the workflow of doctors dramatically. Although EHR systems allow fast access to patient information, documenting in these systems is often quite complex and time intensive from a physician perspective.
A growing body of research also links this documentation process within EHR systems to physician burnout. One approach to reduce the burden of documentation and to limit the physicians’ interactions with EHR systems is to shift some of these responsibilities to medical scribes. Medical scribes can be physically present during the encounter or be remote. We focus our solutions on technology assist for remote scribes.
The figure below is illustrative of remote scribing processes. In the left, it shows the physician documenting, while largely ignoring the patient. In the right, a scribe is involved in listening and documenting the visit. We look at two different modes: synchronous and asynchronous. In synchronous mode the scribes are virtually present and can interact with the physician. In the asynchronous mode, a recording of the conversation is sent to the scribe to generate documentation after the visit. The context for our work discussed here is supporting asynchronous scribing.
We observed that a significant number of recordings from asynchronous scribing contain scribe-directed dictation segments. Dictated segments are expected to be entered, more or less, verbatim into the EHR system. To improve efficiency, a priori information about what segment is conversational or scribe directed dictations are valuable, particularly when the latter segments are automatically transcribed.
In this work we analyze the behavior of 21 physicians and describe our experiments to partition doctor-patient conversations in conversation or dictation regions. Our contributions include behavioral analysis of how physicians dictate in the context of doctor-patient conversations, linguistic analysis of speaking style changes between conversation and dictation across different physicians, development of first machine learning model with strong performance on the task of segmenting doctor-patient conversation into dictation and conversation regions.
Our data set comprises of 105 audio recordings of 21 orthopedic physicians (five randomly selected recordings per physician) who used an asynchronous scribing service. An audio file contains either a conversation with a patient (with or without dictation), or the physician exclusively dictating to the scribe. The physicians recorded themselves using smartphones or tablets. A simple bit encoding of where the dictation occurred was employed to manually characterize each audio file. The scheme captures whether the dictation segment occurs in the front, middle, end or a combination thereof. The annotation is used combine speech region with in-between pauses (silence region) into contiguous dictation or conversation regions. The example below shows the case where dictation occurs in beginning, middle and end.
The data set characteristics in terms of total audio and dictation regions and locations shown below.
To understand qualitatively how physicians sound when they are dictating compared to when they are talking to another person, we selected four conversations that contained dictation somewhere during the encounter from four different physicians. We manually transcribed the dictation part as well as conversation parts of similar duration around the dictation. We included disfluencies and filler words in the transcription. The transcribed data confirmed that all four speakers switched to a different speaking style during dictation. All physicians flattened intonation, put stress only on the main syllable of the key words of each utterance, and either lengthened the last syllable of an utterance or produced a filled pause after it.
A Time Delay Neural Network (TDNN) automatic speech recognition (ASR) model was trained on handheld microphones and mobile devices. The recognition results were based on interpolating two language models one trained on a mix of clinical note dictations and another trained on conversational speech and text data.
The segmentation created a new audio segment when no speech was detected for 1.0 seconds or more.
To evaluate the segmentation classification model that we developed, we used two metrics: classification error rate (CER) and F1 score to evaluate performance of all our models. CER measures how much time of the total recording is misclassified, while F1 is calculated as the harmonic mean between recall and precision of the model on the dictated regions. Ground truth labels are provided by manual annotation.
The test protocol employed was leave-one-physician-out cross-validation. In essence, train on audios associated with 20 physicians (~80 audio files) and test on left out physician. Average results across all physicians are then reported.
To ground progress we used two Chance Classifiers labeling all audio either as conversational or dictation, that serve as baselines for our experiments. Five different ways to segment the audio into conversation and dictation regions were implemented. These are briefly discussed here (for fuller treatment see the paper).
Random Forest: Acoustic features only: In this experiment a random forest model was constructed using acoustic features extracted using the Librosa library. Feature set included root mean square (RMS), fundamental frequency, zero-crossing rate, chroma short-term Fourier transform (STFT), spectral centroid, spectral bandwidth, spectral rolloff, and spectral flatness. We used speech detection to identify speech segments that were separated by at least one second of no speech.
Rule-based approach using ASR hypotheses: The approach here is to look for text with high confidence that correspond to what physicians typically use for dictation. Examples include: “start dictation,” comma,” period,” ”chief complaint,” “assessment and plan” and “new paragraph.” Any audio segment that contained one of these types of keywords was labeled as “dictation.”
Language Models (LM) likelihood ratio: Here the goal is determined by which LM, one generated from clinical text or one generated from conversational text, best explains the text seen.
Hidden-Markov-Model (HMM)-Conditioned LM: This method also uses ASR hypotheses. It models the text generated as an HMM-process with two states – conversation, dictation. The transition probabilities were biased to remain in the same state (0.9). The Viterbi algorithm is used to segment the word sequence into dictation and conversation regions, allowing transitions within a speech segment.
Combining all features using a Random Forest: Here all the above features are combined using a random forest classifier.
The following table shows the summary of results across the different methods implemented.
The acoustic only model identified fundamental frequency and energy features to be the most influential for identifying dictation, which aligns with the results of the linguistic analysis. Combining all the features into one random forest model, yielded the best overall results. Ablation study identified that all the features had an impact on the classification, but the language model features to be the most important.
In conclusion separating physician conversations and dictations is a real world challenge. In this paper we show a viable way in which this segmentation can be achieved in practice.