The use of large language models (LLMs) to augment processes across literature searches, abstract and full-text screening, and the extraction and summarisation of data have demonstrated the potential to increase efficiency and accuracy in literature reviews, transforming them from the resource-intensive research projects they currently are to a living source of all current evidence, summarised and synthesised at the click of a button. Literature reviews have therefore been described as the ‘low hanging fruit’ of artificial intelligence (AI) applications in health economic and outcomes research (HEOR). Is this description justified?
Following on from ISPOR Europe 2024 in Barcelona, where research demonstrated early success in automating abstract screening, the use of AI in literature reviews again proved to be a hot topic at ISPOR 2025. The presentations and panels delved further into increasingly sophisticated AI techniques, with a greater focus on automated data extraction. However, the challenges discussed last year around hallucinations, legal considerations and the need for benchmarking remained a critical focus in ensuring the trustworthiness and regulatory acceptability of AI tools in this domain.
There were multiple research presentations focusing on the use of AI in data extraction, demonstrating progress in this field:
At least one piece of research focused on the use of AI in search strategy development. An LLM-based reasoning agent was evaluated in building Boolean search strings using chain-of-thought reasoning and an iterative agentic workflow whereby a ‘Generator’ LLM suggested search terms and the ‘Critic’ LLM evaluated the search results and provided changes.2 Results were validated against 10 Cochrane SLRs, achieving a recall of 76.8%. This was deemed to be acceptable recall but given that almost a quarter of relevant records were not identified, it does pose the question of whether this could really be deemed acceptable for literature reviews being used to inform regulatory and/or health technology assessment (HTA) submissions.
The acceptability of using AI in literature reviews for submissions to key decision-making bodies, such as those submitted as part of HTA, was discussed on a broader level. A scoping review aimed to identify guidelines and recommendations for using AI in literature reviews, summarising the following:2
Hallucinations, where models generate plausible but incorrect data, remain a substantial concern. These can arise from poor prompting, a lack of using other techniques (such as quality control by another LLM) to mitigate this or current limitations in the model themselves. Guards against this could include requesting that the LLM provides the level of confidence it has in the output, or otherwise to state if it does not know the answer. Sending your prompt multiple times and taking the modal output is another way to achieve something akin to LLM quality control over the outputs, acknowledging that this would then increase the computing power used for each prompt by a magnitude of the number of times you are sending the prompt. However, even with this, potential problems arise – lots of models do converge on the same answer as they are often trained on the same data. For example, Claude Sonnet and ChatGPT have been found to make identical mistakes for the same abstracts even though they are different models.4 So, even when options exist to mitigate concerns, there are still nuances to be considered.
Rigorous performance validation, benchmarking (e.g., via the ELEVATE-AI and TRIPOD-LLM initiatives), and transparent documentation are critical to ensure trustworthy AI outputs.5 At the moment there are no established benchmarks for which LLM to use for which purpose, as models are being updated all the time.4 This raises a key challenge when using AI, both in literature reviews and in the wider HEOR space: how to automate reviewing the LLM’s outputs/how to benchmark this, as this will set up a cycle of continuous improvement.4
Moreover, legal considerations regarding copyright for paywalled content and proprietary datasets further complicate deployment. Careful management of intellectual property rights is required as, in some cases at least, the onus is still on the user rather than the platform owner to determine the copyright implications for each publication accessed.
The community recognizes that, while AI could substantially enhance the efficiency of literature screening and data extraction, full automation cannot replace the need for human oversight. This poses a number of challenges. The burden on human resource requirements is shifting from generating work to reviewing work, and providing human review of an AI-generated output is a very different skill to generating work from scratch.6 This could be where the use of a second LLM to evaluate the outputs and point out the areas requiring human review could come in6 (although, as previously discussed, there are inherent challenges in using LLMs to review their own work). Additionally, there is research benefit to be gained from the human review of records or extraction of data in learning more about the research question and body of evidence, which often informs and improves the review outputs.7 To overlook this is to oversimplify the role played by humans in literature reviews and possibly exaggerate the benefits to be gained in using AI.
AI’s role in literature reviews is poised to grow, with ongoing developments aimed at addressing current limitations. Embracing structured prompting techniques, robust validation and audit trails will be essential to harness AI effectively in this domain. For organisations, the key takeaway is to balance technological innovation with adherence to validation standards, ensuring that automation complements human expertise. While AI is certainly poised to make great strides in this area, describing literature reviews as the ‘low-hanging fruit’ application of AI in HEOR perhaps oversimplifies the work that still needs to done: in the development of suitable AI tools; the rigorous testing of these; and the requirement for not only a human, but an expert human in the loop, to ensure continuation of the highest quality standards, such that systematic reviews can still be relied upon as the gold standard of evidence generation in HEOR.
References
The Literature Reviews team at Costello Medical are actively working on developing and implementing AI tools to increase efficiency in literature reviews. If you would be interested in collaborating with us to test these on your literature review projects, or if you would like any further information on the themes presented above, please get in touch, or visit our Literature Reviews page to find out how our expertise can benefit you. Liz Lunn (Account Coordination Manager) created this article on behalf of Costello Medical. The views/opinions expressed are her own and do not necessarily reflect those of Costello Medical’s clients/affiliated partners.