Following on from ISPOR 2025 held in Montreal earlier this year, the use of AI in literature reviews remained a headline topic amongst the posters, panels and presentations at ISPOR Europe. We’re seeing a more mature ecosystem, with literature review platforms and products increasingly claiming the capability of plugging into databases, generating evidence syntheses, and evolving with ongoing updates. Regulators and health technology assessment (HTA) bodies are part of the conversation, with transparency, auditability, and human-in-the-loop oversight becoming baseline expectations for submissions incorporating AI.
Research by Nested Knowledge identified a staggering 138 ISPOR Europe 2025 abstracts mentioning AI or large language models (LLMs), with the dominant application being evidence synthesis (60 studies including systematic literature reviews [SLRs], targeted literature reviews, screening, data extraction and quality assessment).1 Our poster on this topic took the form of an SLR to identify primary research studies that reported time or workload saved from applying AI to a specific aspect of a literature review.2 Most literature reviews using AI applied it to the screening stage (45/56), with only two reviews using AI for extractions, demonstrating that progress in this area wasn’t living up to the noise (as of June 2025). The median workload saved was 65%, the median time saved was 60%, and the median time saved per study was 1.02 minutes. Authors were more likely to be positive or cautiously positive than negative about the potential of AI to help conduct literature reviews. It should, however, be noted that assumptions on time taken by a human reviewer appear unrealistic in some studies, creating outliers e.g. one study reported saving 15.5 hours per risk of bias assessment, when the average for the Cochrane-recommended Risk of Bias 2.0 tool is no more than 1–2 hours by a Costello Medical reviewer with 1–2 years’ experience. The assumptions of human reviewer time should therefore be critically assessed when reviewing time saving metrics.
In an exciting first step into regulatory submissions for AI, Pharmacoevidence Pvt. Ltd. and Gilead Lifesciences presented “Gen AI in Systematic Literature Reviews: The First Case Study on GenAI in a National Institute for Health and Care Excellence (NICE) Submission”.3 AI was implemented as the second reviewer in the title/abstract screening stage of the submission’s SLRs, with a reported two-week reduction in timelines. The decision alignment between AI and human reviewers ranged from 89.01% (humanistic burden review) to 99.59% (economic evaluation review); any disagreements between the human and AI reviewers were resolved by an independent human reviewer. However, the presenters reported that all excluded citations, even those where both reviewers agreed, were re-checked by a second, independent human reviewer to confirm that no relevant studies were missed. Therefore, this begs the question: how much time was actually saved?
Alongside the reporting of the first NICE submission using AI, a representative from NICE presented their five key concerns surrounding the current status of AI use in SLRs.3 We report these below, along with practical mitigations that we’re embedding into Costello Medical workflows based on our research in this area.
| Risk/Bias | Description | Potential Mitigations | Costello Medical Approach |
|---|---|---|---|
| Context Bias | Occurs when LLMs are biased towards the prompter, often occurring when the prompt template is too complex and the LLM learns the ‘voice’ of the user | Independently verify prompts | What are we doing? Our AI-enabled literature review workflows use a 100% human-in-the-loop approach to validate AI outputs. Our prompt development and testing research also includes multiple mitigations against bias including the use of a benchmark compiled by two independent reviewers, and testing prompts on benchmarks they weren’t developed on to assess generalisability. The results of this research is published in a peer-reviewed journal, to enable external critique4
What comes next? We plan to develop and publish a test data set, which can be used by external bodies to verify reported data accuracy metrics |
| Benchmark Bias | Occurs when the benchmark standard itself has errors e.g. if you are comparing the AI outputs to a human ‘gold-standard’ which has errors due to mislabelling or subjectivity | Independently verify prompts | As above |
| Explainability | LLMs are ‘black boxes’, the results should not only be plausible but also verifiable and justifiable | Request output justifications from the LLM, with exact quotes | What are we doing? Our AI-enabled literature review workflows include highlighting by the AI model in the evidence base to track the sources of AI-generated outputs and enable quick human verification |
| Hallucinations | When LLMs ‘make up’ plausible but incorrect results | Prompt engineering using retrieval-augmented generation (RAG) and self-consistency. RAG combines a retriever which gathers information from a knowledge base with an LLM that writes the answer based on this information. Self-consistency increases reliability by generating multiple outputs and selecting the most likely final answer from these | What are we doing? Our AI-enabled literature review workflows use a 100% human-in-the-loop approach to guard against hallucination
What comes next? We plan to implement agentic AI approaches in our end-to-end literature review platform e.g. the use of a conflict resolution/review system with two LLMs taking the role of systematic reviewers and a third LLM as an adjudicator with different instructions to the task execution agents, or crowdsourcing (multiple LLM agents evaluate the same data sources and produce outputs; a central aggregator chooses the most common output) |
| Algorithmic Bias | Occurs when a model favours one subset of data over another, for example medical domains | Present performance across multiple domains | What are we doing? We transparently report our data extraction accuracy metrics, including the data domain and volume of data points they are based on and the transferability of the prompts across different disease areas, in our manuscript ‘Harnessing LLMs for Efficient Data Extraction in SLRs’4 |
Perusing the exhibitor booths in the main hallway at ISPOR Europe, the increasingly crowded landscape of AI-enabled literature review platforms was immediately clear. At least five new platforms have popped up since ISPOR 2025 (Montreal) in May; this rapid development of tools leads to procurement uncertainty, and the challenge of assessing variable performance between platforms.
Wednesday’s Issue Panel ‘Battle of the Bots: Navigating the Landscape of AI-Enabled SLR Platforms’ comprised representatives from Nested Knowledge, MadeAI and DistillerSR, who were understandably keen to use the forum to promote their platforms.5 Tough questions from the audience helped delve into where AI is currently least successful in assisting the literature review process. The panellists agreed that AI generates the least successful outputs in review stages which are more subjective (deriving information from publications, conducting critical appraisals and making conclusions). It was also suggested that quantitative data extraction will always be difficult to achieve accurately with AI due to the heterogeneity of reporting. Data extraction with AI is still considered experimental by most platforms and not typically applied to regulatory submissions.
One audience member highlighted the lack of external validation studies on these platforms, which would help compare and contrast their accuracy. While understandable given that the use of AI in literature reviews is still relatively new (and in some cases experimental) and the inevitable delays associated with publication in a peer-reviewed journal, until such studies are forthcoming the onus is on the customer to determine the accuracy of these commercial products for their needs.
With this in mind, how should HEOR professionals assess the suitability of AI-enabled literature review platforms and determine whether the accuracy meets their needs?
Costello Medical have conducted hundreds of literature reviews, and we are proud of our reputation for high quality. While optimistic about the potential for AI in literature reviews – to the extent that we have invested heavily in our own research and publications – we are also thoughtful about its application, as we are unwilling to sacrifice quality for efficiency when high quality is essential (for example, in SLRs to inform reimbursement submissions). For such projects, our approach to the use of AI is therefore purposeful, trusted and accountable: we rigorously assess AI systems before use and maintain a 100% human-in-the-loop approach, creating transparent processes that can be fully explained. We have been using our own, proprietary literature review platform for over seven years, into which we are now incorporating optional AI assistance, in a way that is aligned with our values and external guidance from bodies such as RAISE and Cochrane.
We are actively looking for external partners to pilot our app early in 2026. If you are interested, please get in touch to discuss further.
For more details on our research on AI in literature reviews to date, please visit the links below:
What seems clear is that AI in literature reviews is heading in a positive direction, with the increasing use of AI in literature review screening, and the first acceptance of this for a NICE submission, albeit with almost the same level of human oversight as in a traditional SLR. AI in literature reviews is here to stay, but success hinges on balancing speed with rigour and regulatory alignment.
At Costello Medical, we are embedding AI where it meaningfully reduces time and preserves or enhances quality, while maintaining transparent documentation and a robust human-in-the-loop. This means living governance that evolves with new tools, ongoing validation across domains, and a culture of careful scrutiny, especially when our work is subject to review by regulatory or HTA bodies.
We’re cautiously optimistic, but we are still a long way away from AI making meaningful impacts in the complex and detail-heavy world of HEOR, with its strict quality requirements for regulatory or reimbursement purposes.
References
If you would like any further information on the summary presented above, please get in touch, or visit our Evidence Development page. Liz Lunn (Account Coordination Manager) contributed to this article on behalf of Costello Medical. The views/opinions expressed are their own and do not necessarily reflect those of Costello Medical’s clients/affiliated partners.