AI in Literature Reviews: Hype or Helpful?

Following on from ISPOR 2025 held in Montreal earlier this year, the use of AI in literature reviews remained a headline topic amongst the posters, panels and presentations at ISPOR Europe. We’re seeing a more mature ecosystem, with literature review platforms and products increasingly claiming the capability of plugging into databases, generating evidence syntheses, and evolving with ongoing updates. Regulators and health technology assessment (HTA) bodies are part of the conversation, with transparency, auditability, and human-in-the-loop oversight becoming baseline expectations for submissions incorporating AI.

GenAI in Literature Reviews: Where We Are and Where We’re Going

Research by Nested Knowledge identified a staggering 138 ISPOR Europe 2025 abstracts mentioning AI or large language models (LLMs), with the dominant application being evidence synthesis (60 studies including systematic literature reviews [SLRs], targeted literature reviews, screening, data extraction and quality assessment).1 Our poster on this topic took the form of an SLR to identify primary research studies that reported time or workload saved from applying AI to a specific aspect of a literature review.2 Most literature reviews using AI applied it to the screening stage (45/56), with only two reviews using AI for extractions, demonstrating that progress in this area wasn’t living up to the noise (as of June 2025). The median workload saved was 65%, the median time saved was 60%, and the median time saved per study was 1.02 minutes. Authors were more likely to be positive or cautiously positive than negative about the potential of AI to help conduct literature reviews. It should, however, be noted that assumptions on time taken by a human reviewer appear unrealistic in some studies, creating outliers e.g. one study reported saving 15.5 hours per risk of bias assessment, when the average for the Cochrane-recommended Risk of Bias 2.0 tool is no more than 1–2 hours by a Costello Medical reviewer with 1–2 years’ experience. The assumptions of human reviewer time should therefore be critically assessed when reviewing time saving metrics.

 

Two graphs showing percent workload saved and time saved

 

In an exciting first step into regulatory submissions for AI, Pharmacoevidence Pvt. Ltd. and Gilead Lifesciences presented “Gen AI in Systematic Literature Reviews: The First Case Study on GenAI in a National Institute for Health and Care Excellence (NICE) Submission”.3 AI was implemented as the second reviewer in the title/abstract screening stage of the submission’s SLRs, with a reported two-week reduction in timelines. The decision alignment between AI and human reviewers ranged from 89.01% (humanistic burden review) to 99.59% (economic evaluation review); any disagreements between the human and AI reviewers were resolved by an independent human reviewer. However, the presenters reported that all excluded citations, even those where both reviewers agreed, were re-checked by a second, independent human reviewer to confirm that no relevant studies were missed. Therefore, this begs the question: how much time was actually saved?

HTA Body Concerns and Our Response: A Practical Playbook

Alongside the reporting of the first NICE submission using AI, a representative from NICE presented their five key concerns surrounding the current status of AI use in SLRs.3 We report these below, along with practical mitigations that we’re embedding into Costello Medical workflows based on our research in this area.

Risk/Bias Description Potential Mitigations Costello Medical Approach
Context Bias Occurs when LLMs are biased towards the prompter, often occurring when the prompt template is too complex and the LLM learns the ‘voice’ of the user Independently verify prompts What are we doing? Our AI-enabled literature review workflows use a 100% human-in-the-loop approach to validate AI outputs. Our prompt development and testing research also includes multiple mitigations against bias including the use of a benchmark compiled by two independent reviewers, and testing prompts on benchmarks they weren’t developed on to assess generalisability. The results of this research is published in a peer-reviewed journal, to enable external critique4

What comes next? We plan to develop and publish a test data set, which can be used by external bodies to verify reported data accuracy metrics

Benchmark Bias Occurs when the benchmark standard itself has errors e.g. if you are comparing the AI outputs to a human ‘gold-standard’ which has errors due to mislabelling or subjectivity Independently verify prompts As above
Explainability LLMs are ‘black boxes’, the results should not only be plausible but also verifiable and justifiable Request output justifications from the LLM, with exact quotes What are we doing? Our AI-enabled literature review workflows include highlighting by the AI model in the evidence base to track the sources of AI-generated outputs and enable quick human verification
Hallucinations When LLMs ‘make up’ plausible but incorrect results Prompt engineering using retrieval-augmented generation (RAG) and self-consistency. RAG combines a retriever which gathers information from a knowledge base with an LLM that writes the answer based on this information. Self-consistency increases reliability by generating multiple outputs and selecting the most likely final answer from these What are we doing? Our AI-enabled literature review workflows use a 100% human-in-the-loop approach to guard against hallucination

What comes next? We plan to implement agentic AI approaches in our end-to-end literature review platform e.g. the use of a conflict resolution/review system with two LLMs taking the role of systematic reviewers and a third LLM as an adjudicator with different instructions to the task execution agents, or crowdsourcing (multiple LLM agents evaluate the same data sources and produce outputs; a central aggregator chooses the most common output)

Algorithmic Bias Occurs when a model favours one subset of data over another, for example medical domains Present performance across multiple domains What are we doing? We transparently report our data extraction accuracy metrics, including the data domain and volume of data points they are based on and the transferability of the prompts across different disease areas, in our manuscript ‘Harnessing LLMs for Efficient Data Extraction in SLRs’4

Navigating a Crowded Bot Landscape: How to Assess Suitability of AI-Enabled Platforms for Your Literature Review Needs

Perusing the exhibitor booths in the main hallway at ISPOR Europe, the increasingly crowded landscape of AI-enabled literature review platforms was immediately clear. At least five new platforms have popped up since ISPOR 2025 (Montreal) in May; this rapid development of tools leads to procurement uncertainty, and the challenge of assessing variable performance between platforms.

Wednesday’s Issue Panel ‘Battle of the Bots: Navigating the Landscape of AI-Enabled SLR Platforms’ comprised representatives from Nested Knowledge, MadeAI and DistillerSR, who were understandably keen to use the forum to promote their platforms.5 Tough questions from the audience helped delve into where AI is currently least successful in assisting the literature review process. The panellists agreed that AI generates the least successful outputs in review stages which are more subjective (deriving information from publications, conducting critical appraisals and making conclusions). It was also suggested that quantitative data extraction will always be difficult to achieve accurately with AI due to the heterogeneity of reporting. Data extraction with AI is still considered experimental by most platforms and not typically applied to regulatory submissions.

One audience member highlighted the lack of external validation studies on these platforms, which would help compare and contrast their accuracy. While understandable given that the use of AI in literature reviews is still relatively new (and in some cases experimental) and the inevitable delays associated with publication in a peer-reviewed journal, until such studies are forthcoming the onus is on the customer to determine the accuracy of these commercial products for their needs.

With this in mind, how should HEOR professionals assess the suitability of AI-enabled literature review platforms and determine whether the accuracy meets their needs?

  • Query the reported accuracy metrics – what type and volume of data points are they based on? Most platforms report “80–90% accuracy”, but it is often unclear what this is based on as there are currently no reporting guidelines for AI accuracy metrics. On closer investigation, metrics are usually based on the extraction of a small number of ‘study level characteristics’ (study design, geographic location, number of participants), not the large volumes of detailed characteristics and outcomes needed to support a meta-analysis or HTA submission. Therefore, it is challenging for the customer to make an informed decision about whether and where to use AI in literature reviews for regulatory or reimbursement submissions without doing their own further investigation.
  • Look for a company who are transparent and upfront about where AI is less accurate and ask for the metrics on that. Can they advise on which aspects of data extraction AI performs less well at e.g. arm-level extraction, safety data extraction, and what are their published accuracy metrics on those aspects?
  • Put it to the test on a review that reflects all the complexities of a real-life SLR. Compare the AI-generated results with those of a human reviewer – do the results live up to the claims?

Our SLR App

Costello Medical have conducted hundreds of literature reviews, and we are proud of our reputation for high quality. While optimistic about the potential for AI in literature reviews – to the extent that we have invested heavily in our own research and publications – we are also thoughtful about its application, as we are unwilling to sacrifice quality for efficiency when high quality is essential (for example, in SLRs to inform reimbursement submissions). For such projects, our approach to the use of AI is therefore purposeful, trusted and accountable: we rigorously assess AI systems before use and maintain a 100% human-in-the-loop approach, creating transparent processes that can be fully explained. We have been using our own, proprietary literature review platform for over seven years, into which we are now incorporating optional AI assistance, in a way that is aligned with our values and external guidance from bodies such as RAISE and Cochrane.

We are actively looking for external partners to pilot our app early in 2026. If you are interested, please get in touch to discuss further.

For more details on our research on AI in literature reviews to date, please visit the links below:

  • Use of AI to summarise PICOS during abstract review (presented at ISPOR Europe 2024)
  • Prompt development for data extraction from clinical (published as a manuscript in Cochrane Evidence Synthesis and Methods) and economic (presented at ISPOR 2025, Montreal) publications using AI
  • Prompt development for conducting quality assessments of economic evaluations (presented at ISPOR Tokyo 2025)
  • Evaluation of a traditional (validated search filter) vs machine learning (classifier) approach for identifying RCTs in an SLR (abstract co-authored with the Head of Evidence Pipeline and Data Curation at Cochrane and presented at the International Congress on Peer Review and Scientific Publication 2025)

The Road Ahead: Balancing Speed, Rigour, and Regulation

What seems clear is that AI in literature reviews is heading in a positive direction, with the increasing use of AI in literature review screening, and the first acceptance of this for a NICE submission, albeit with almost the same level of human oversight as in a traditional SLR. AI in literature reviews is here to stay, but success hinges on balancing speed with rigour and regulatory alignment.

At Costello Medical, we are embedding AI where it meaningfully reduces time and preserves or enhances quality, while maintaining transparent documentation and a robust human-in-the-loop. This means living governance that evolves with new tools, ongoing validation across domains, and a culture of careful scrutiny, especially when our work is subject to review by regulatory or HTA bodies.

We’re cautiously optimistic, but we are still a long way away from AI making meaningful impacts in the complex and detail-heavy world of HEOR, with its strict quality requirements for regulatory or reimbursement purposes.

References

  1. Patel K. Nested Knowledge. AI-Related Presentations and Posters: ISPOR EU 2025.
  2. Bobrowska A, Lunn L, Chan K, Murton M. SA50. How Much Time Does Artificial Intelligence Really Save in Evidence Synthesis? A Systematic Literature Review. Presented at ISPOR Europe Congress, Glasgow, UK. 2025.
  3. Issue Panel 102: Gen AI in Systematic Literature Reviews: First Case Study in Gen AI in a NICE Submission. Presented at ISPOR Europe Congress, Glasgow, UK. 2025.
  4. Murton M, Boulton E, Cross S, et al. Harnessing Large‐Language Models for Efficient Data Extraction in Systematic Reviews: The Role of Prompt Engineering. Cochrane Evidence Synthesis and Methods 2025;3:e70058.
  5. Issue Panel 136: Battle of the Bots: Navigating the Landscape of AI-Enabled Systematic Literature Review Platforms. Presented at ISPOR Europe Congress, Glasgow, UK. 2025.

If you would like any further information on the summary presented above, please get in touch, or visit our Evidence Development page. Liz Lunn (Account Coordination Manager) contributed to this article on behalf of Costello Medical. The views/opinions expressed are their own and do not necessarily reflect those of Costello Medical’s clients/affiliated partners.

Discover more insights

Cookies Overview
Costello Medical

Our website uses cookies to distinguish you from other users. This helps us to provide you with a good experience when you browse our website and also allows us to improve our site. Cookies are files saved on your phone, tablet or computer generated when you visit a website and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.

You can select to accept or reject non-essential cookies using the toggle below. For full details of the cookies we use, please see our Cookies Policy and Privacy Notice.

Strictly Necessary Cookies

These essential cookies do things like: remembering the notifications you've seen so we do not show them to you again or your progress through a form. They always need to be on.

Non-essential Cookies

We use these to collect information on how our users engage with our website so that we can improve the experience of the website for our users. For example, we collect information about which of our pages are most frequently visited, and by which types of users.