Mental Health Counseling using LLMs

17 min readMay 1, 2024

By: Aman Arham, Nikunj Mehta, Nathan Sandoval, Anvit Raju, Arsh Gamare, Hao-Zhe Loo

In this article, we will show our approach and explain the project we worked on involving LLMs and their use to foster mental health counseling for users. We will review the entire process—starting with the dataset, training and testing the model, and making predictions and comparisons with other, more sophisticated models.

Introduction

Motivation

In society, there has been a rising prevalence of mental health disorders affecting a number of individuals, ranging from children to adults. As a result, there is an urgent need for innovative and accessible interventions. According to the World Health Organization, nearly 970 million people are affected by mental health conditions, with anxiety and depression ranking as the most common. This staggering number presents a severe health challenge and impacts economic productivity and overall quality of life. Mental health disorders are not only associated with significant healthcare costs but lead to substantial productivity losses, further exacerbating the societal burden. Furthermore, we can see that people with mental health problems have a reduced life expectancy of about 10 to 20 years. Because of the severe side effects of mental health disorders, there is an immediate need to find a solution to help alleviate the pressures individuals are facing sooner rather than later.

In parallel to the rapid increase in mental health problems, artificial intelligence (AI) has gained significant momentum, with GenAI becoming more commonly used. With this parallelized growth of mental health issues and GenAI, combining and leveraging large language models (LLMs) for mental health counseling could provide a transformative solution to addressing such challenges. By harnessing the power of LLMs, such projects aim to develop tools that can offer consistent, empathetic, and immediate support to individuals experiencing mental health issues. As a group, we strive to use a number of LLMs and train them across a dataset relating to mental health conditions. Based on those results, our main objective is to compare the outputs between the different LLMs, using ChatGPT as a baseline, and determine which LLM provides the best response to the user. Going through this project will help show whether or not an LLM can serve as a resource of menthol health resources, making quality care available to a broader audience irrespective of geographical and economic barriers. In addition, integrating AI into mental health services could enhance the efficiency and effectiveness of treatments, support mental health professionals, and contribute to an overall decrease in mental health disorders.

Goal

The primary goal of this project is to investigate the efficacy of LLMs fine-tuned explicitly for the purpose of mental health counseling. This is achieved by making detailed comparisons of the models, including Llama, Google Gemma, and Mistral AI, to identify which model will give the best performance and most accurately deliver empathetic and contextually relevant responses to people facing ongoing mental health challenges. By utilizing ChatGPT as a ground truth benchmark, we will train various models on a diverse dataset that contains various mental health-related problems and compare their performance against ChatGPT. By training these models on a comprehensive dataset of mental health interactions, we aim to evaluate each model’s ability to provide accurate, empathetic, and contextually appropriate responses to users with mental health concerns. This project aims to determine the potential of LLMs as reliable and accessible mental health resources that can significantly contribute to reducing healthcare disparities and enhancing the efficiency of mental health services. Through this comparative analysis, the project will contribute to understanding how artificial intelligence can enhance mental health care delivery, thereby addressing the increasing prevalence of mental health issues and improving overall public health outcomes.

Models

A series of different comparisons are going to be made. Firstly, in our project, we will be using Llama 3, Mistral AI, and Google Gemma as the three main LLMs with which we will train the dataset. These LLMs were released earlier in the year and are open-source for users to use. We were able to get access to the repositories on Hugging Face for each LLM. Llama 3 is the newest model and takes seven billion parameters, similar to Mistral AI. Gemma also has a version with seven billion parameters; however, to better see the difference in responses, we decided to use the 2 billion parameter version of Google Gemma, having a hypothesis that its output will be less sophisticated compared to the other LLMs.

Llama 3: meta-llama/Meta-Llama-3–8B
Mistral 7B: mistralai/Mistral-7B-Instruct-v0.2
Gemma 2B: google/gemma-2b

We used GPT 3.5 turbo’s API as a baseline and passed in the same inputs as the LLM. The reason we decided to use GPT as a baseline is because ChatGPT is more prevalent in society, and as a result, a lot of people would be going to chat to get some advice; the model has become very complex and takes in around 3.5 billion parameters, which is a good baseline; Three, the model is very reliable because of the vast dataset used to train the model. Given the cost of each invocation of the GPT API is $0.002, we were able to make a series of requests to get outputs and use them to compare with the LLMs. Using ChatGPT, we aim to determine the model’s relative performance to other LLMs for mental health concerning tasks without being trained on the specific datasets.

Dataset

We used two main datasets: Amod/mental_health_counseling_converations and jkhedri/psychology-dataset.

The first dataset is a collection of questions and answers from online counseling and therapy platforms. It consisted of over 3,500 rows with two columns: context and response. Context is essentially the input of the model, where the user gives information regarding their current situation. In addition, the response is the counseling advice given to the individual based on that context. The dataset repeats a series of contexts, providing multiple different types of responses that can be taken for a single given context.

The second dataset is a series of contexts given by an individual and a set of responses that can be taken for each context. This dataset has around 10,000 rows with three columens: one column with the contex and two different responses, one being more empathetic than the other.

Since the second dataset has three columns, we removed the column with the less empathetic responses and merged the datasets together. This allows us to better train the models on a more vast and targeted dataset.

Training and Configuration

Hardware

Initially, we began training the model on our local PCs. However, given the sheer size of the model, there was not enough RAM locally to train and hold it. As a result, we shifted to using Google Cloud Platform, using Google Collab Pro and Collab Enterprise, where we were able to access state-of-the-art A100 GPUs to train and inference the models. Once proper resources were available, utilizing the model repos from HuggingFace was much easier and more effective, taking around 30–45 minutes to train the model on a dataset of 13,000 rows.

Software

We first load the tokenizer and the model using the following commands:

tokenizer = AutoTokenizer.from_pretrained(base_model)
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    quantization_config=bnb_config,
    device_map="auto",
    attn_implementation=attn_implementation
)

The tokenizer is used to transform our database text input into a format that is readable to the LLM. base_model is a string that specifies which LLM we are going to load from HuggingFace. In our training, we set this to “meta-llama/Meta-Llama-3–8B”, “google/gemma-7b”, and “mistralai/Mistral-7B-v0.1'’. quantization_config specified which method of quantization we used. In our case, we used 4-bit quantization which reduces both memory and computational cost, and it is important since the models include billions of parameters.

We then load and shuffle the dataset from local storage using the following commands. This dataset combines the responses from Amod/mental_health_counseling_converations and the invalid responses from jkhedri/psychology-dataset. So the data has 3 columns, prompt, chosen, and rejected. This rejected column is expected and required for ORPOtrainer.

dataset = load_dataset('json', data_files='/content/dataset4pm.json', split='all')
dataset = dataset.shuffle(seed=42)

Additionally, we have to change the chosen (expected response) rows and the rejected (Invalid response) rows in our dataset to the format expected by the LLM so we use the tokenizer.apply_chat_template() method to the entries. Finally, we create a train/test split on our data. By the end, dataset is a dictionary of testing and validation data that has been prepared according to the specific LLMs’ formatting. The code is as follows:

def format_chat_template(row):
    row["chosen"] = tokenizer.apply_chat_template(row["chosen"], tokenize=False)
    row["rejected"] = tokenizer.apply_chat_template(row["rejected"], tokenize=False)
    return row


dataset = dataset.map(
    format_chat_template,
    num_proc= os.cpu_count(),
)
dataset = dataset.train_test_split(test_size=0.1)

The training method we are using is called ORPO and the code is as follows:

orpo_args = ORPOConfig(
    learning_rate=8e-6,
    lr_scheduler_type="linear",
    max_length=1024,
    max_prompt_length=512,
    beta=0.1,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=4,
    optim="paged_adamw_8bit",
    num_train_epochs=20,
    evaluation_strategy="steps",
    eval_steps=0.2,
    logging_steps=1,
    warmup_steps=10,
    report_to="wandb",
    output_dir="./results/",
)


trainer = ORPOTrainer(
    model=model,
    args=orpo_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    peft_config=peft_config,
    tokenizer=tokenizer,
)
trainer.train()
trainer.save_model(new_model)

ORPOconfig specifies hyperparameters for OPROTrainer. In ORPOTrainer, we define the LLM we are going to train, the hyperparameters, the training and evaluation data, the PEFT configuration, and the tokenizer. In this case, we are using LoRA for the PEFT, which allows for faster training because it will update fewer parameters. The tokenizer simply converts the raw text into tokens that our models can understand. After training the model, we save it.

To test the specific model, we use a pipeline function. The code creates a prompt from the defined messages. The prompt includes the existing conversation and some additional generation context. The terminator variable specifies end-of-sequence tokens that signal where the generation should stop.

The code then generates an output pipeline that specifies the temperature (the diversity of the output), maximum tokens, and other parameters that are tuned in a specific way. The generated text is extracted from the output, removing the prompt to get the new response. The code prints this extracted text.

Results

The results for each of the models is very interesting and promising. The LLMs performed better than we initially thought and gave responses that would help individuals suffering from mental health problems.

To evaluate the quality of the response, we looked at two different components: the empathy of the answer and the action items given by the answer. Typically, when seeing a therapist, having sympathy for the problems you are facing helps alleviate some of the burden and stress that an individual faces. In addition, based on studies, empathy is a fundamental component of effective psychotherapy. A meta-analysis by Elliot, Bohart, Watson, and Greenberg in 2011 found that therapist empathy moderately strongly predicts the therapy outcome. The analysis indicated that the therapist’s ability to empathetically understand and resonate with a client’s feelings and experiences correlates positively with improved therapy outcomes.

Furthermore, the actionability of a response from a therapist is also effective for a number of mental health disorders. For example, Cognitive Behavioral Therapy (CBT), known for its directive and problem-focused approach, is one of the most researched and validated forms of therapy, showing high efficacy in treating a range of psychological problems. Because of these reasons, we sought to use empathy and actionability as the main variables to rate and analyze the responses given by each AI.

To give a score for the responses, there was a two-phase approach used: Initially, each of us rated the response on a scale of ten, highlighting components we liked about the response and the effectiveness of the response if we were in the same situation as the patient; After, we used a third-party view from Claude, a GenAI service from Anthropic, to grade each response with the same scales as if it were a therapist. After, we combined and averaged the scores for each response.

Here is the first response we inputted into the models:

I’m dealing with a lot of problems at the moment. First of all, there are evenings when I have a lot of difficulty falling asleep and nights when I sleep a lot yet am still really weary. In addition, I’m getting more easily agitated and having hour-long panic episodes. Do I have a problem, and if so, what should I do about it?

Each of the models provided a decent output, helping resolve the problems the patient was facing. Here are responses from each LLM and an explanation of the effectiveness of each model.

Llama:

It’s good you’re exploring the reasons behind your feelings. Are you stressed about finances, relationships, or employment? This can disrupt sleep and increase worries. Additionally, check if there’s a family history of depression, as symptoms like insomnia, fatigue, irritability, and anxiety could be signs. Also, be aware of severe symptoms like feelings of worthlessness or suicidal thoughts, which need immediate attention. I recommend seeing a doctor for a thorough assessment and monitoring your sleep patterns and triggers. Take care and best wishes.

Based on this response, Llama received a score of 8/10 for empathy and 9/10 for actionability. The response is fairly empathetic. Llama is able to understand and acknowledge the person’s stress, inquiring about the stressors the patient may be having as a result of a number of different variables. In addition, the response maintains a supportive and caring tone, added with a great ending remark. Furthermore, the response also provides a series of concrete steps to take in order to diagnose and determine the underlying problem the patient might be experiencing. Llama advises the patient to monitor sleeping patterns and triggers, as well as look into family backgrounds for potential causes of mental health disorders. Finally, it states that if severe help is needed, it is advised to seek help from a professional. All in all, Llama’s response shows the sympathy needed from a therapist and provides a set of tasks for the patient to complete that may help solve the problem.

Mistral:

It sounds like you’re navigating a difficult period, grappling with sleep troubles and heightened agitation, which may hint at underlying stress or anxiety. Seeking support from a healthcare provider or therapist is crucial; they can offer insights into the root causes and recommend tailored treatments, be it therapy, medication, or lifestyle changes. Taking stock of potential stress triggers in your life and implementing coping mechanisms is also beneficial. Additionally, prioritizing good sleep hygiene, including establishing a soothing bedtime routine and maintaining a consistent sleep schedule, can contribute to better sleep quality and overall well-being.

Now, looking into the next LLM, Mistral also gives a fairly decent response, scoring an 8/10 for empathy and an 8/10 for actionability. This response is very similar to Llama, demonstrating a similar level of empathy by acknowledging the difficulty of the client’s situation and the challenges they face with sleep and agitation. In addition, the response provides a number of different ways to decrease stress and problems, such as improving sleep hygiene and managing stress triggers. Mistral also advises seeing a professional for a more thorough analysis. Mistral did not, however, look back at family for potential causes of the mental health problems and, therefore, gets a slightly lower score for actionability.

Gemma:

It seems like you are experiencing more irritability, difficulty falling asleep at night, and a growing sense of anxiousness. There are a lot of potential causes for the problems you are observing, and a qualified therapist might be able to offer some assistance. You might be able to understand these experiences better in therapy.

Gemma’s response is very basic and does not compare with the other responses given by Mistral and Llama. The response shows little empathy for the patient, only acknowledging the symptoms the client is facing. In addition, the only action given by Gemma is to see a therapist. As a result, this response received a 6/10 empathy score and 6/10 actionability score.

ChatGPT:

It sounds like you’re going through a tough time. It’s important to acknowledge that what you’re experiencing is valid and deserves attention. While I’m not a healthcare professional, the symptoms you’ve described could indicate various issues, including stress, anxiety, or even a sleep disorder. Seeking support from a healthcare provider, such as a therapist or a doctor, would be a good step forward. They can help you understand what might be causing these symptoms and provide you with appropriate guidance and treatment options. In the meantime, practicing relaxation techniques like deep breathing, meditation, or gentle exercise might help manage some of your symptoms. Remember, it’s okay to ask for help, and taking care of your mental and physical health is important. You’re not alone in this, and there are people who can support you through it.

Looking at ChatGPT, it gave a very good response to the prompt. The response is highly empathetic, validating the patient’s feelings and acknowledging the difficulty of their situation. GPT is able to communicate a sense of understanding and concern. Therefore, it was given an empathy score of 8/10, similar to Llama. In terms of action items, the response is quite actionable. GPT encourages the patient to seek support from a healthcare provider and offers practical advice on relaxation techniques like deep breathing, meditation, and gentle exercise that the patient can begin immediately. Because of these, GPT scored a 9/10 for actionability.

After looking at the responses, the general consensus is that Llama, Mistal, and ChatGPT are very good at giving counseling advice, with Llama and ChatGPT being the best. In the presentation, we showcase another prompt with great results, further showing the LLMs' capabilities if trained correctly with a vast amount of data. However, these results were not always the case. Some prompts output weaker results for the LLMs.

One prompt we would like to showcase is the following:

A friend studying psychology suggested I see my doctor to check for major depressive disorder, but I’m worried about the implications. If diagnosed, it could affect my record and possibly prevent me from pursuing a career as a counselor among other things. I’m also hesitant about taking antidepressants. Earlier this year, I had a severe reaction and passed out after taking medication prescribed for a dislocated hip. Many people I know discourage using antidepressants, believing it’s better to seek treatment without them. I’m concerned they might send me to a clinic and insist on medication if they think I can’t make decisions for myself. While I’m not suicidal, I’m also not taking good care of myself. I’ve heard that electroshock therapy is still used, which I’m not comfortable with. Given all this, I’m considering self-help and trying to manage on my own. I’ve been struggling emotionally since elementary school, and despite trying to stay positive after graduation, I still feel unwell.

This prompt was generated using Claude and is much more thorough and provides a deeper context to the situation the patient is going through. Because it is more sophisticated in nature, the LLMs have a harder time understanding and comprehending the situation. When confused the LLMs only action item is seeing a medical professional or therapist. Here are the responses we received.

Llama:

Getting a professional evaluation for depression might help identify the most effective treatments for you. It’s important to openly express any concerns about medications or how a diagnosis might affect your career to your therapist. Receiving expert advice can greatly assist your self-management strategies and enhance your mental well-being.

Mistral:

Seeking a diagnosis for depression can lead to tailored treatment options; discuss any concerns about medication and career impacts with a therapist. Professional guidance can complement your self-help efforts and improve your overall mental health management.

Gemma:

Consult a therapist for personalized depression treatment options and discuss any worries about medication or how it could affect your career. Professional help can boost your self-management strategies.

ChatGPT:

It’s completely natural to have concerns about seeking help for your emotional struggles, but prioritizing your mental health is crucial. Remember, seeking help doesn’t necessarily lead to a diagnosis or affect your future career prospects negatively, as mental health professionals prioritize confidentiality. If you’re uncomfortable with antidepressants or have had negative experiences with medication, there are alternative treatments like therapy and self-help strategies to explore. Electroconvulsive therapy (ECT) is reserved for severe cases and always conducted with careful consideration and informed consent. While self-help strategies can be beneficial, they may not be sufficient alone, so consider reaching out to a mental health professional for tailored support and guidance. Taking care of your mental health is essential, and seeking help is a courageous step towards living a fulfilling life.

In these responses, we can see that ChatGPT output a very thorough response, similar to the first prompt in the paper. However, the other LLMs were not able to give effective counseling and defaulted to seeing a therapist for proper support. We noticed this trend with more complex prompts. This shows that although LLMs may be able to respond to a wide variety of contexts, they are not ready to tackle full mental health counseling and are not as generalizable as ChatGPT.

Conclusion

In the field of mental health counseling, using Large Language Models (LLMs) has shown to be a promising space of support for people seeking guidance and understanding. Through exploring different results of models, each LLM we have tested presented unique approaches and nuances in their outputs. Although there are some inaccuracies from time to time, there are possibilities of creating mental health counselors using LLMs and larger models that take in more parameters.

When accurate responses are made, LLama highlights its use of empathy, establishing profound emotional connections with users through its responses. It has a personalized touch and empathetic tone to promote a sense of understanding and companionship, which are very important elements in the field of mental health support. Mistral shares Llama’s capability for emotional connection while leaning slightly towards referencing medical professionals, making it a bit less actionable.

Gemma, on the other hand, promotes a more robotic aspect, indicative of its smaller parameter size. While it is still useful, its straightforward approach may lack the warmth and depth desired in mental health counseling. This was expected due to the smaller nature of the LLM. However, it is important to note that each LLM brings its own strengths to the table and caters to a diverse range of user preferences and needs.

In our evaluation, we found ChatGPT to be a very strong competitor to LLama and Mistral in its ability to provide meaningful counsel, provided it is not trained on the dataset. Its advice aligns closely with their empathetic approach, creating a supportive environment for mental well-being.

Nevertheless, our evaluation process came with challenges. Accessing sufficient computational resources was a hurdle, highlighting the importance of robust infrastructure in making use of the full potential of LLMs. Navigating through the various options offered by platforms like Hugging Face underscored the need for clarity and guidance in selecting the most suitable models for specific applications.

Looking ahead, there is a lot of potential for expansion and refinement. Collaborating with therapists to evaluate the efficacy of LLM responses from a clinical perspective promises valuable insights, helping our understanding of what constitutes effective counseling. Additionally, widening the scope of testing to have a more diverse range of prompts helps expand LLM capabilities in mental health support. The possibilities are endless. Using larger models that take in 400 billion parameters may provide the best results given the dataset. Because of our limited resources, we were not able to test that, but it is a future avenue to explore that looks very promising.

Our exploration into the field of mental health counseling with LLMs has shown a path filled with empathy, understanding, and potential. As we continue to utilize these technologies' power, we must remain committed to fostering holistic well-being and support for all.

Resources

“Who highlights urgent need to transform mental health and Mental Health Care,” World Health Organization, https://www.who.int/news/item/17-06-2022-who-highlights-urgent-need-to-transform-mental-health-and-mental-health-care (accessed May 1, 2024).
Amod, “Amod/mental_health_counseling_conversations · Datasets at hugging face,” Amod/mental_health_counseling_conversations · Datasets at Hugging Face, https://huggingface.co/datasets/Amod/mental_health_counseling_conversations (accessed May 1, 2024).
J. Khedri, “JKHEDRI/Psychology-dataset · datasets at hugging face,” jkhedri/psychology-dataset · Datasets at Hugging Face, https://huggingface.co/datasets/jkhedri/psychology-dataset (accessed May 1, 2024).
B. Marie, “Orpo: Preference optimization without the supervised fine-tuning (SFT) step,” Medium, https://towardsdatascience.com/orpo-preference-optimization-without-the-supervised-fine-tuning-sft-step-60632ad0f450 (accessed May 1, 2024).
“Meta-llama/meta-llama-3–8b · hugging face,” meta-llama/Meta-Llama-3–8B · Hugging Face, https://huggingface.co/meta-llama/Meta-Llama-3-8B (accessed May 1, 2024).
“Mistralai/Mistral-7B-instruct-v0.2 · hugging face,” mistralai/Mistral-7B-Instruct-v0.2 · Hugging Face, https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2 (accessed May 1, 2024).
“Google/gemma-2b · hugging face,” google/gemma-2b · Hugging Face, https://huggingface.co/google/gemma-2b (accessed May 1, 2024).