ISPOR Europe 2025

AI in Literature Reviews: Hype or Helpful?

Following on from ISPOR 2025 held in Montreal earlier this year, the use of AI in literature reviews remained a headline topic amongst the posters, panels and presentations at ISPOR Europe. We’re seeing a more mature ecosystem, with literature review platforms and products increasingly claiming the capability of plugging into databases, generating evidence syntheses, and evolving with ongoing updates. Regulators and health technology assessment (HTA) bodies are part of the conversation, with transparency, auditability, and human-in-the-loop oversight becoming baseline expectations for submissions incorporating AI.

GenAI in Literature Reviews: Where We Are and Where We’re Going

Research by Nested Knowledge identified a staggering 138 ISPOR Europe 2025 abstracts mentioning AI or large language models (LLMs), with the dominant application being evidence synthesis (60 studies including systematic literature reviews [SLRs], targeted literature reviews, screening, data extraction and quality assessment).¹ Our poster on this topic took the form of an SLR to identify primary research studies that reported time or workload saved from applying AI to a specific aspect of a literature review.² Most literature reviews using AI applied it to the screening stage (45/56), with only two reviews using AI for extractions, demonstrating that progress in this area wasn’t living up to the noise (as of June 2025). The median workload saved was 65%, the median time saved was 60%, and the median time saved per study was 1.02 minutes. Authors were more likely to be positive or cautiously positive than negative about the potential of AI to help conduct literature reviews. It should, however, be noted that assumptions on time taken by a human reviewer appear unrealistic in some studies, creating outliers e.g. one study reported saving 15.5 hours per risk of bias assessment, when the average for the Cochrane-recommended Risk of Bias 2.0 tool is no more than 1–2 hours by a Costello Medical reviewer with 1–2 years’ experience. The assumptions of human reviewer time should therefore be critically assessed when reviewing time saving metrics.

In an exciting first step into regulatory submissions for AI, Pharmacoevidence Pvt. Ltd. and Gilead Lifesciences presented “Gen AI in Systematic Literature Reviews: The First Case Study on GenAI in a National Institute for Health and Care Excellence (NICE) Submission”.³ AI was implemented as the second reviewer in the title/abstract screening stage of the submission’s SLRs, with a reported two-week reduction in timelines. The decision alignment between AI and human reviewers ranged from 89.01% (humanistic burden review) to 99.59% (economic evaluation review); any disagreements between the human and AI reviewers were resolved by an independent human reviewer. However, the presenters reported that all excluded citations, even those where both reviewers agreed, were re-checked by a second, independent human reviewer to confirm that no relevant studies were missed. Therefore, this begs the question: how much time was actually saved?

HTA Body Concerns and Our Response: A Practical Playbook

Alongside the reporting of the first NICE submission using AI, a representative from NICE presented their five key concerns surrounding the current status of AI use in SLRs.³ We report these below, along with practical mitigations that we’re embedding into Costello Medical workflows based on our research in this area.

Risk/Bias	Description	Potential Mitigations	Costello Medical Approach
Context Bias	Occurs when LLMs are biased towards the prompter, often occurring when the prompt template is too complex and the LLM learns the ‘voice’ of the user	Independently verify prompts	What are we doing? Our AI-enabled literature review workflows use a 100% human-in-the-loop approach to validate AI outputs. Our prompt development and testing research also includes multiple mitigations against bias including the use of a benchmark compiled by two independent reviewers, and testing prompts on benchmarks they weren’t developed on to assess generalisability. The results of this research is published in a peer-reviewed journal, to enable external critique⁴ What comes next? We plan to develop and publish a test data set, which can be used by external bodies to verify reported data accuracy metrics
Benchmark Bias	Occurs when the benchmark standard itself has errors e.g. if you are comparing the AI outputs to a human ‘gold-standard’ which has errors due to mislabelling or subjectivity	Independently verify prompts	As above
Explainability	LLMs are ‘black boxes’, the results should not only be plausible but also verifiable and justifiable	Request output justifications from the LLM, with exact quotes	What are we doing? Our AI-enabled literature review workflows include highlighting by the AI model in the evidence base to track the sources of AI-generated outputs and enable quick human verification
Hallucinations	When LLMs ‘make up’ plausible but incorrect results	Prompt engineering using retrieval-augmented generation (RAG) and self-consistency. RAG combines a retriever which gathers information from a knowledge base with an LLM that writes the answer based on this information. Self-consistency increases reliability by generating multiple outputs and selecting the most likely final answer from these	What are we doing? Our AI-enabled literature review workflows use a 100% human-in-the-loop approach to guard against hallucination What comes next? We plan to implement agentic AI approaches in our end-to-end literature review platform e.g. the use of a conflict resolution/review system with two LLMs taking the role of systematic reviewers and a third LLM as an adjudicator with different instructions to the task execution agents, or crowdsourcing (multiple LLM agents evaluate the same data sources and produce outputs; a central aggregator chooses the most common output)
Algorithmic Bias	Occurs when a model favours one subset of data over another, for example medical domains	Present performance across multiple domains	What are we doing? We transparently report our data extraction accuracy metrics, including the data domain and volume of data points they are based on and the transferability of the prompts across different disease areas, in our manuscript ‘Harnessing LLMs for Efficient Data Extraction in SLRs’⁴

Navigating a Crowded Bot Landscape: How to Assess Suitability of AI-Enabled Platforms for Your Literature Review Needs

Perusing the exhibitor booths in the main hallway at ISPOR Europe, the increasingly crowded landscape of AI-enabled literature review platforms was immediately clear. At least five new platforms have popped up since ISPOR 2025 (Montreal) in May; this rapid development of tools leads to procurement uncertainty, and the challenge of assessing variable performance between platforms.

Wednesday’s Issue Panel ‘Battle of the Bots: Navigating the Landscape of AI-Enabled SLR Platforms’ comprised representatives from Nested Knowledge, MadeAI and DistillerSR, who were understandably keen to use the forum to promote their platforms.⁵ Tough questions from the audience helped delve into where AI is currently least successful in assisting the literature review process. The panellists agreed that AI generates the least successful outputs in review stages which are more subjective (deriving information from publications, conducting critical appraisals and making conclusions). It was also suggested that quantitative data extraction will always be difficult to achieve accurately with AI due to the heterogeneity of reporting. Data extraction with AI is still considered experimental by most platforms and not typically applied to regulatory submissions.

One audience member highlighted the lack of external validation studies on these platforms, which would help compare and contrast their accuracy. While understandable given that the use of AI in literature reviews is still relatively new (and in some cases experimental) and the inevitable delays associated with publication in a peer-reviewed journal, until such studies are forthcoming the onus is on the customer to determine the accuracy of these commercial products for their needs.

With this in mind, how should HEOR professionals assess the suitability of AI-enabled literature review platforms and determine whether the accuracy meets their needs?

Query the reported accuracy metrics – what type and volume of data points are they based on? Most platforms report “80–90% accuracy”, but it is often unclear what this is based on as there are currently no reporting guidelines for AI accuracy metrics. On closer investigation, metrics are usually based on the extraction of a small number of ‘study level characteristics’ (study design, geographic location, number of participants), not the large volumes of detailed characteristics and outcomes needed to support a meta-analysis or HTA submission. Therefore, it is challenging for the customer to make an informed decision about whether and where to use AI in literature reviews for regulatory or reimbursement submissions without doing their own further investigation.
Look for a company who are transparent and upfront about where AI is less accurate and ask for the metrics on that. Can they advise on which aspects of data extraction AI performs less well at e.g. arm-level extraction, safety data extraction, and what are their published accuracy metrics on those aspects?
Put it to the test on a review that reflects all the complexities of a real-life SLR. Compare the AI-generated results with those of a human reviewer – do the results live up to the claims?

Our SLR App

Costello Medical have conducted hundreds of literature reviews, and we are proud of our reputation for high quality. While optimistic about the potential for AI in literature reviews – to the extent that we have invested heavily in our own research and publications – we are also thoughtful about its application, as we are unwilling to sacrifice quality for efficiency when high quality is essential (for example, in SLRs to inform reimbursement submissions). For such projects, our approach to the use of AI is therefore purposeful, trusted and accountable: we rigorously assess AI systems before use and maintain a 100% human-in-the-loop approach, creating transparent processes that can be fully explained. We have been using our own, proprietary literature review platform for over seven years, into which we are now incorporating optional AI assistance, in a way that is aligned with our values and external guidance from bodies such as RAISE and Cochrane.

We are actively looking for external partners to pilot our app early in 2026. If you are interested, please get in touch to discuss further.

For more details on our research on AI in literature reviews to date, please visit the links below:

Use of AI to summarise PICOS during abstract review (presented at ISPOR Europe 2024)
Prompt development for data extraction from clinical (published as a manuscript in Cochrane Evidence Synthesis and Methods) and economic (presented at ISPOR 2025, Montreal) publications using AI
Prompt development for conducting quality assessments of economic evaluations (presented at ISPOR Tokyo 2025)
Evaluation of a traditional (validated search filter) vs machine learning (classifier) approach for identifying RCTs in an SLR (abstract co-authored with the Head of Evidence Pipeline and Data Curation at Cochrane and presented at the International Congress on Peer Review and Scientific Publication 2025)

The Road Ahead: Balancing Speed, Rigour, and Regulation

What seems clear is that AI in literature reviews is heading in a positive direction, with the increasing use of AI in literature review screening, and the first acceptance of this for a NICE submission, albeit with almost the same level of human oversight as in a traditional SLR. AI in literature reviews is here to stay, but success hinges on balancing speed with rigour and regulatory alignment.

At Costello Medical, we are embedding AI where it meaningfully reduces time and preserves or enhances quality, while maintaining transparent documentation and a robust human-in-the-loop. This means living governance that evolves with new tools, ongoing validation across domains, and a culture of careful scrutiny, especially when our work is subject to review by regulatory or HTA bodies.

We’re cautiously optimistic, but we are still a long way away from AI making meaningful impacts in the complex and detail-heavy world of HEOR, with its strict quality requirements for regulatory or reimbursement purposes.

References

Patel K. Nested Knowledge. AI-Related Presentations and Posters: ISPOR EU 2025.
Bobrowska A, Lunn L, Chan K, Murton M. SA50. How Much Time Does Artificial Intelligence Really Save in Evidence Synthesis? A Systematic Literature Review. Presented at ISPOR Europe Congress, Glasgow, UK. 2025.
Issue Panel 102: Gen AI in Systematic Literature Reviews: First Case Study in Gen AI in a NICE Submission. Presented at ISPOR Europe Congress, Glasgow, UK. 2025.
Murton M, Boulton E, Cross S, et al. Harnessing Large‐Language Models for Efficient Data Extraction in Systematic Reviews: The Role of Prompt Engineering. Cochrane Evidence Synthesis and Methods 2025;3:e70058.
Issue Panel 136: Battle of the Bots: Navigating the Landscape of AI-Enabled Systematic Literature Review Platforms. Presented at ISPOR Europe Congress, Glasgow, UK. 2025.

AI in Literature Reviews: Hype or Helpful?

GenAI in Literature Reviews: Where We Are and Where We’re Going

HTA Body Concerns and Our Response: A Practical Playbook

Navigating a Crowded Bot Landscape: How to Assess Suitability of AI-Enabled Platforms for Your Literature Review Needs

Our SLR App

The Road Ahead: Balancing Speed, Rigour, and Regulation

Discover more insights

Harnessing Large-Language Models for Efficient Data Extraction in Systematic Reviews: The Role of Prompt Engineering

Literature Reviews – The Promised ‘Low-Hanging Fruit’ Application of AI in HEOR?

Innovating Literature Reviews with AI: Our Approach

AI in Literature Reviews: Maximising Benefits, Reducing Harms

Better Together: The Impact of Joint Clinical Assessments on Systematic Literature Reviews

In Defence of the Targeted Literature Review