Causal language plays a critical role in scientific communication, as it shapes public understanding, informs policy, and impacts healthcare decisions. When causal statements are ambiguous or misleading, they can lead to confusion and misinterpretation of research findings. This study explores how large language models (LLMs) can improve causal language in academic writing. It employs a two-step approach: first, distinguishing non-causal from causal statements, and second, classifying causal sentences as correlational, conditional causal, or direct causal. The models were fine-tuned on a blended dataset of general-purpose (news, web) and scientific (social science, biomedical) human-labeled sentences. The BERT-based classifier achieved a macro F1-score of 0.94 for detecting causal versus non-causal sentences, while SciBERT attained 0.83 in distinguishing correlational, conditional causal, and direct causal statements. To explore how these classifiers can be applied in practice, a tool was developed to analyze scientific papers and texts, offering personalized warnings and highlighting potential inconsistencies in causal reasoning. By providing researchers with a (visual) overview of causal strength and alignment with study design, the tool supports clearer, more precise communication of research findings. This study demonstrates how LLMs can enhance the clarity and precision of causal language in academic writing, offering a scalable approach to improving scientific communication.
Final tool functionalities
Summary
Look at the classifications per sections, get a summary on the study design and some writing tips
Align claims
Toggle section in or out of your view to check wether your claims made in the abstract still match with the strength of your claims in the conclusion
Lenient vs Strict classifications
You can decide if you want to see all its classifications or only the one that it is very sure of
Explanations
Let the model explain its decision and ask a follow up question
Warnings
Actionable tips based on what the model found
Try the demo version here
Context
Causal language is often misused in scientific papers leading to confusion about the implications of findings
An example how quickly the media can misinterpret unclear causal language use from scientific papers
An example of an unclear causal statement
Three levels of causality
As humans we can only clearly distinguish three categories of causal relationships. So these are the labels used for fine-tuning.
Training data
The human‑labeled training data was put together based on exisiting datasets
Dataset usage
Model selection, is bigger always better?
BERT vs. GPT architecture
BERT vs. GPT architecture
Evaluation 2 labels
Confusion matrices (2 labels)
Evaluation metrics comparison, BERT best performing model
Evaluation 3 labels
Confusion matrices (3 labels)
Evaluation metrics comparison, SciBERT best performing model
Evaluation final models
Learning curves showing the training of the final models
Misclassifications by the final model that are also very hard to classify as a human. Showing the complexity of this task
Integration into a Tool
10 best practices when writing causal language deducted from literature are used to provide actionable tips
First the scientific paper in PDF format is processed by a service called GROBID to create a strutured XML file. Then all the headers, figures etc are recognized and organized by a python script. The refences and the introduction are removed as they do not need to be classified by the model. Then first every sentence is classified into causal or non causal. Then all the causal sentences are classified into correlational, conditional causal or direct causal. The method is put into a llama model to provide a summary of the paper and its study design.
Tool interface screenshot
The tool has been tested by 5 researchers on reviewing their own papers for causal language usage