Literature Reviews – The Promised ‘Low-Hanging Fruit’ Application of AI in HEOR?

The use of large language models (LLMs) to augment processes across literature searches, abstract and full-text screening, and the extraction and summarisation of data have demonstrated the potential to increase efficiency and accuracy in literature reviews, transforming them from the resource-intensive research projects they currently are to a living source of all current evidence, summarised and synthesised at the click of a button. Literature reviews have therefore been described as the ‘low hanging fruit’ of artificial intelligence (AI) applications in health economic and outcomes research (HEOR). Is this description justified?

Following on from ISPOR Europe 2024 in Barcelona, where research demonstrated early success in automating abstract screening, the use of AI in literature reviews again proved to be a hot topic at ISPOR 2025. The presentations and panels delved further into increasingly sophisticated AI techniques, with a greater focus on automated data extraction. However, the challenges discussed last year around hallucinations, legal considerations and the need for benchmarking remained a critical focus in ensuring the trustworthiness and regulatory acceptability of AI tools in this domain.

Current Status and Applications

There were multiple research presentations focusing on the use of AI in data extraction, demonstrating progress in this field:

  • AGHealth.ai, a tool which screens, extracts and synthesises biomedical literature using genAI, was demonstrated.1 Results were presented for the accuracy of data extractions, compared with a human reviewer, on 61 records, with no mistakes made in the genAI extractions. However, extractions were performed for one outcome of interest only (complete response), therefore not directly comparable to the breadth of data required for extraction in most systematic literature reviews (SLRs).
  • Claude 3.7 Sonnet was used to read full articles, including PDFs with graphs and complex tables, to automate data extraction across oncology literature.2 Almost 120,000 data points were extracted with the rate of true positives exceeding 90% across study characteristics, intervention characteristics, participant characteristics and outcomes. The reproducibility of these results in other disease areas was not reported in this presentation.
  • Our own innovation and literature review experts presented a research poster on prompt development and testing for data extraction in literature reviews using an AI model.3 Prompts were developed to extract data from economic publications reporting economic models, utility data and cost/resource use data in one disease area and then tested on another disease area to determine the transferability across indications. F1 scores (the harmonic mean of precision and recall) of greater than 0.7 were achieved across all study types, and these scores were generally maintained or exceeded (range 0.68-0.97) on testing the same prompts in a small number of records (n=4) in another disease area, indicating that at least in economic papers, prompts for data extraction could be used in different disease areas with minimal updates. However, testing was only performed in a small number of disease areas; further testing is needed to confirm this hypothesis and provide confidence in the results.

At least one piece of research focused on the use of AI in search strategy development. An LLM-based reasoning agent was evaluated in building Boolean search strings using chain-of-thought reasoning and an iterative agentic workflow whereby a ‘Generator’ LLM suggested search terms and the ‘Critic’ LLM evaluated the search results and provided changes.2 Results were validated against 10 Cochrane SLRs, achieving a recall of 76.8%. This was deemed to be acceptable recall but given that almost a quarter of relevant records were not identified, it does pose the question of whether this could really be deemed acceptable for literature reviews being used to inform regulatory and/or health technology assessment (HTA) submissions.

The acceptability of using AI in literature reviews for submissions to key decision-making bodies, such as those submitted as part of HTA, was discussed on a broader level. A scoping review aimed to identify guidelines and recommendations for using AI in literature reviews, summarising the following:2

  • The Responsible AI in Evidence SynthEsis (RAISE) first draft recommendations require AI use to be reported in evidence synthesis manuscripts transparently, with researchers remaining ultimately responsible for the evidence synthesis and ensuring ethical, legal and regulatory standards are adhered to when using AI. In 2025, the recommendations will be revised to include more details, how-to guidance and create consensus-based guidelines for responsible AI use in evidence synthesis.
  • A new joint AI Methods Group between the Cochrane Collaboration, the Campbell Collaboration, JBI and the Collaboration for Environmental Evidence (CEE) has been established to focus on AI and automation in evidence synthesis. One of the first actions of this group was to provisionally endorse the next version of the RAISE recommendations and guidance for use.
  • Cochrane’s focus is on responsible use of AI in systematic reviews: encouraging studies within reviews and research into AI tools, public sharing of this research, and transparency in reporting when AI is used in writing.
  • The ISPOR Working Group on Generative AI publication highlights that SLRs can be augmented in almost every stage of the review, but not as an autonomous replacement for humans.
  • The National Institute for Health and Care Excellence (NICE) and Canada’s Drug Agency (CDA-AMC) are aligned in that LLMs may support in most literature review stages; the key area where this methodology is less established is in automated data extraction.
  • Gaps were identified in addressing how to adapt AI to the less structured nature of pragmatic literature reviews, where reference sets are generally smaller and more heterogeneous.

Challenges and Future Direction

Hallucinations, where models generate plausible but incorrect data, remain a substantial concern. These can arise from poor prompting, a lack of using other techniques (such as quality control by another LLM) to mitigate this or current limitations in the model themselves. Guards against this could include requesting that the LLM provides the level of confidence it has in the output, or otherwise to state if it does not know the answer. Sending your prompt multiple times and taking the modal output is another way to achieve something akin to LLM quality control over the outputs, acknowledging that this would then increase the computing power used for each prompt by a magnitude of the number of times you are sending the prompt. However, even with this, potential problems arise – lots of models do converge on the same answer as they are often trained on the same data. For example, Claude Sonnet and ChatGPT have been found to make identical mistakes for the same abstracts even though they are different models.4 So, even when options exist to mitigate concerns, there are still nuances to be considered.

Rigorous performance validation, benchmarking (e.g., via the ELEVATE-AI and TRIPOD-LLM initiatives), and transparent documentation are critical to ensure trustworthy AI outputs.5 At the moment there are no established benchmarks for which LLM to use for which purpose, as models are being updated all the time.4 This raises a key challenge when using AI, both in literature reviews and in the wider HEOR space: how to automate reviewing the LLM’s outputs/how to benchmark this, as this will set up a cycle of continuous improvement.4

Moreover, legal considerations regarding copyright for paywalled content and proprietary datasets further complicate deployment. Careful management of intellectual property rights is required as, in some cases at least, the onus is still on the user rather than the platform owner to determine the copyright implications for each publication accessed.

The community recognizes that, while AI could substantially enhance the efficiency of literature screening and data extraction, full automation cannot replace the need for human oversight. This poses a number of challenges. The burden on human resource requirements is shifting from generating work to reviewing work, and providing human review of an AI-generated output is a very different skill to generating work from scratch.6 This could be where the use of a second LLM to evaluate the outputs and point out the areas requiring human review could come in6 (although, as previously discussed, there are inherent challenges in using LLMs to review their own work). Additionally, there is research benefit to be gained from the human review of records or extraction of data in learning more about the research question and body of evidence, which often informs and improves the review outputs.7 To overlook this is to oversimplify the role played by humans in literature reviews and possibly exaggerate the benefits to be gained in using AI.

Implications

AI’s role in literature reviews is poised to grow, with ongoing developments aimed at addressing current limitations. Embracing structured prompting techniques, robust validation and audit trails will be essential to harness AI effectively in this domain. For organisations, the key takeaway is to balance technological innovation with adherence to validation standards, ensuring that automation complements human expertise. While AI is certainly poised to make great strides in this area, describing literature reviews as the ‘low-hanging fruit’ application of AI in HEOR perhaps oversimplifies the work that still needs to done: in the development of suitable AI tools; the rigorous testing of these; and the requirement for not only a human, but an expert human in the loop, to ensure continuation of the highest quality standards, such that systematic reviews can still be relied upon as the gold standard of evidence generation in HEOR.

References

  1. Educational Symposium 030: Driving Evidence-Based Medicine Forward with Generative AI (GenAI). Presented at ISPOR International Congress, Montreal, Canada. 2025.
  2. Research Podium 066: AI-Assisted Literature Reviews: Requirements and Advances. Presented at ISPOR International Congress, Montreal, Canada. 2025.
  3. Lunn L, Cross S, Kumar S, Boulton E, Khan A, Magri G, Slater D, Tiwari S, Murton M. MSR84. Data Extraction in Literature Reviews Using an Artificial Intelligence Model: Prompt Development and Testing. Presented at ISPOR International Congress, Montreal, Canada. 2025.
  4. Introduction to Applied Generative AI Short Course. ISPOR International Congress, Montreal, Canada. 2025.
  5. Issue Panel 038: Accelerating the Adoption of Generative AI in HEOR: Lessons from Early Adopters. Presented at ISPOR International Congress, Montreal, Canada. 2025.
  6. Breakout Session 015: From General to HEOR-Specific: Transforming LLMs into Reliable Research Tools. Presented at ISPOR International Congress, Montreal, Canada. 2025.
  7. Issue Panel 070: AI Agents and Guardrails in HEOR: The Ultimate Solution to GenAI Shortcomings or Just Another Overhyped Tool? Presented at ISPOR International Congress, Montreal, Canada. 2025.

The Literature Reviews team at Costello Medical are actively working on developing and implementing AI tools to increase efficiency in literature reviews. If you would be interested in collaborating with us to test these on your literature review projects, or if you would like any further information on the themes presented above, please get in touch, or visit our Literature Reviews page to find out how our expertise can benefit you. Liz Lunn (Account Coordination Manager) created this article on behalf of Costello Medical. The views/opinions expressed are her own and do not necessarily reflect those of Costello Medical’s clients/affiliated partners.

Discover more insights

Cookies Overview
Costello Medical

Our website uses cookies to distinguish you from other users. This helps us to provide you with a good experience when you browse our website and also allows us to improve our site. Cookies are files saved on your phone, tablet or computer generated when you visit a website and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.

You can select to accept or reject non-essential cookies using the toggle below. For full details of the cookies we use, please see our Cookies Policy and Privacy Notice.

Non-essential Cookies

We use these to collect information on how our users engage with our website so that we can improve the experience of the website for our users. For example, we collect information about which of our pages are most frequently visited, and by which types of users.