SDCNL

This paper was accepted to 30th International Conference on Artificial Neural Networks (ICANN 2021) which will be held in September of 2021.

Abstract

Early detection of suicidal ideation in depressed individuals can allow for adequate medical attention and support, which in many cases is life-saving. Recent NLP research focuses on classifying, from a given piece of text, if an individual is suicidal or clinically healthy. However, there have been no major attempts to differentiate between depression and suicidal ideation, which is an important clinical challenge. Due to the scarce availability of EHR data, suicide notes, or other similar verified sources, web query data has emerged as a promising alternative. Online sources, such as Reddit, allow for anonymity that prompts honest disclosure of symptoms, making it a plausible source even in a clinical setting. However, these online datasets also result in lower performance, which can be attributed to the inherent noise in web-scraped labels, which necessitates a noise-removal process. Thus, we propose SDCNL, a suicide versus depression classification method through a deep learning approach. We utilize online content from Reddit to train our algorithm, and to verify and correct noisy labels, we propose a novel unsupervised label correction method which, unlike previous work, does not require prior noise distribution information. Our extensive experimentation with multiple deep word embedding models and classifiers display the strong performance of the method in a new, challenging classification application.

Noisy Labels

Using Reddit data poses the issue of noisy labels, or corrupted labels. In order to utilize large webscraped datasets, we use unsupervised clustering to perform confidence-based correction of the labels in the dataset.

GMM clustering using BERT embeddings and PCA reduction to 2 dimensions show the difficulty of the clustering task, as there is little variety in the clusters and they heavily overlap. We use co-variance type "full".

Correction rates of the label correction algorithms at different noise rates on the IMDB dataset. The left figure displays correction rates with uniform noise injections, and the right figure displays correction rates with class-weighted noise injections. For example, we split by 30%-10% or 25%-15%.

Classification

We use deep transformers and classifiers to perform suicide vs depression classification. After extensive experimentation, SDCNL achieves upwards of 95% accuracy, outperforming numerous other studies on a more difficult task.

ROC curves of performance of top 4 models with label correction (red) against the same models without label correction (blue).

ROC curves of model performance from four best models on our task (proposed) against the conventional suicide vs healthy task (standard).

Paper

Poster

Citation

@article{haque2021deep
  title = {Deep Learning for Suicide and Depression Identification with Unsupervised Label Correction}, 
  author = {Ayaan Haque and Viraaj Reddi and Tyler Giallanza},
  year = {2021},
  eprint = {2102.09427},
  archivePrefix = {arXiv},
  primaryClass  = {cs.LG}
}

Deep Learning for Suicide and Depression Identification with Unsupervised Label Correction