• Effective convolutional attention network for multi-label clinical document classification

    Yang Liu¹, Hua Cheng¹, Russell Klopfer¹, Matthew R Gormley², Thomas Schaaf¹
    ¹3M HIS ²Carnegie Mellon University
    3M HIS coding research team

    • Introduction

      In this paper, published in EMNLP 2021, we present an effective convolutional attention network for the multi-label document classification (MLDC) problem with a focus on medical code prediction from clinical documents.

      MLDC has a great number of practical applications, one of which is automatic medical coding, where a patient encounter containing multiple records are assigned with appropriate medical codes. Many medical encounters need to be coded for billing purposes every day. Professional clinical coders often use rule-based or simple ML-based systems to assign billing codes, but the large code space (viz. the ICD-10 code system with more than 90,000 codes) and long documents are challenging for ML models.

      In addition, coding requires extracting useful information from specific locations across the entire encounter to support the assigned codes. Consequently, effective models with the capability of handling these challenges will have an immense impact in the medical domain by helping to reduce coding cost, improve coding accuracy and increase customer satisfaction. Deep learning methods have been demonstrated to produce the state-of-the-art outcomes on benchmark MLDC and medical coding tasks (examples here, here and here), but demands remain for more effective and accurate solutions.

      In this paper, we propose EffectiveCAN, an effective convolution attention network for MLDC. Our models try to strike a careful balance of simplicity and over-parameterization such that we can effectively model long documents and capture nuanced aspects of the whole document texts. Such a model is particularly useful for addressing the challenges of automatic medical coding. We evaluate our models on the widely used MIMIC-III dataset, and attain state-of-the-art results across multiple commonly used metrics. We also demonstrate the language independent nature of our approach by coding on two non-English datasets. Our model outperforms prior best model and a multilingual transformer model by a substantial margin.

    • Methods

      Our EffectiveCAN model (figure below) is composed of four major components: an input layer that transforms the raw document texts into pretrained word embeddings, a deep convolution-based encoder that combines the information of adjacent words and learns meaningful representations of the document texts, an attention component that selects the most important text features and generates label-specific representations for each label, and an output layer that produces the final predictions.

      The model structure is primarily designed for generating better predictions on multi-label classification tasks from three aspects:

      1. Generating meaningful representations for input texts
      2. Selecting informative features from text representations for label prediction
      3. Preventing overconfidence on frequent labels.

      Firstly, in order to obtain high quality representations of the document texts, we incorporate the squeeze-and-excitation (SE) network and the residual network into the convolution-based encoder. The encoder consists of multiple encoding blocks to enlarge the receptive field and capture text patterns with different lengths.

      Secondly, instead of only using the last encoder layer output for attention, we extract all encoding layer outputs and apply the attention to select the most informative features for each label. Finally, to cope with the long-tail distribution of the labels, we use a combination of the binary cross entropy loss and focal loss to make the model perform well on both frequent and rare labels.

      The figures below show the architecture of EffectiveCAN (left), the structure of a Res-SE block containing a SE module and a residual module (middle), and the structure of the SE network (right). Nw - number of words in the document; de – word embedding dimension. Xe – embedding vector; Nl - the number of labels; dl - the embedding size of each label; U – label embedding matrix; V1-4 – label specific representation.

    Three figures

    A figure on the left provides an architectural view of EffectiveCAN. The input document Xe consisting of Nw words of dimensionality de is fed through a collection of Res-SE blocks. Each block outputs a embedding representation H. A collection of Attention nodes attends to all the embedding representations H along with input from label embeddings (of all the ICD10 codes) to output label specific representations. These are fed to a linear sigmoid layer to output codes relevant to the input document.

    A figure in the middle illustrates a Res-SE block. It consists of another 1d convolution of the input data in parallel with a SE-block. The results are added to produce the embedding representation H.

    A figure on the right represents a squeeze-and-excitation (SE) convolution block. It takes the input document, sends it through 1d convolution. Then the data is sent through a squeeze and excitation process and is used to scale the original document X to output a new X.

    • Datasets and results

      We evaluated our model on the widely used medical benchmark dataset MIMIC-III, as well as two medical datasets in Dutch and French respectively.

      Table 3 below (from the paper) shows the results on the MIMIC-III dataset using the full ICD-9 codes. Our model achieved the strongest results across multiple metrics compared to the other systems. It improves the state-of-the-art Micro F1 score as well as ranking-based precision scores (P@K – precision at 8 and 15).

    Table displaying the results on the MIMIC-III dataset using the full ICD-9 codes with the authors' model showing strong results

    On the Dutch and French datasets, we establish two baselines. The first is MultiResCNN, which is the best performing model on MIMIC-III that is publicly available. The second is XLM-RoBERTa, a multi-lingual transformer model.

    However, only Effective-CAN can be trained on the full label set (i.e. 144 codes for Dutch, 940 codes for French): XLMRoBERTa and MultiResCNN run out of 16GB GPU memory. As such, we resort to comparison with the baselines on only the top-50 codes. XLMRoBERTa yields poor results for both Dutch and French. Recall is particularly low, likely caused by the model only seeing the first 512 subwords of a long encounter with thousands of tokens. Our model with multi-layer attention substantially outperforms the other two systems.

    Table showing the number of labels, precision, recall and F1 for Dutch and French across different models. The results show that the authors' model with multi-layer attention outperforms the other two systems

    • Conclusion

      We proposed an effective convolutional attention network for MLDC and showed its effectiveness for medical coding on long documents. Our model features a deep and more refined convolutional encoder, consisting of multiple Res-SE blocks, to capture the multi-scale patterns of the document texts.

      Furthermore, we use the multilayer attention to adaptively select the most relevant features for each label. We employ the focal loss to improve the rare-label prediction without sacrificing the overall performance. Our model obtains the state-of-the-art results across several metrics on MIMIC-III and compares favorably with other systems on two non-English datasets.