Trust Me I’m a (Ph)Doctor: Ensuring Integrity and Reliability in AI-Augmented HEOR Work

The buzz around the use of AI in health economics and outcomes research (HEOR) and real-world evidence (RWE) has continued this year after considerable interest at ISPOR International 2024 and ISPOR Europe 2023. Potential uses of AI span numerous areas, including identifying PICO criteria (a key topic ahead of the implementation of the Joint Clinical Assessment [JCA] process), conducting systematic literature reviews (SLRs) and network meta-analyses and developing economic evaluations and Health Technology Assessment (HTA) dossiers. The potential of AI as a tool to help Health Technology Developers meet very tight JCA dossier development timelines was highlighted in several sessions, with Ipek Özer Stillman illustrating several examples of where AI can be leveraged to improve efficiency in a specific session on this topic.1

The most frequently highlighted applications of generative AI (genAI) in HEOR at ISPOR Europe 2024 were integration into the SLR process and health economic model development process to enhance efficiency.

Accelerating Insights: GenAI in the SLR Process

SLRs were referred to as the ‘low-hanging fruit’ for AI integration throughout the conference. Various AI applications have been explored to enhance SLR efficiency, particularly in the literature screening stage:

  • A notable method, “prioritised screening,” involves advancing the most relevant articles to the start of the abstract screening queue, enabling their early identification. Many AI-assisted screening platforms now provide this feature. One study, conducted by a team from Costello Medical, examined GPT-3.5-turbo for creating PICO summaries to aid human reviewers, achieving up to a 50% efficiency improvement with positive feedback from most reviewers.2 Conversely, Cichewhiz et al. (2024) also assessed the performance of a GPT-4-driven tagging approach originally intended for use in data extraction in assisting abstract screening. Interestingly, the authors considered this approach to be currently unsuitable as an independent screening tool due to low recall and accuracy.3
  • The adoption of genAI as a secondary reviewer for abstract screening was also key area of exploration. Several studies highlighted significant workload reductions and high sensitivity with AI-assisted SLR systems, though specificity varied. Donovan et al. (2024) reported a 92.1% concordance between AI and human reviewers.4 Generally, AI tools demonstrated higher sensitivity and lower specificity than human reviewers, highlighting their potential role as a safety net to ensure no relevant studies are overlooked when screening alongside a human review team.5,6
  • There is also increasing exploration of the use of AI in full-text screening and data extraction.

Performance of the evaluated tools often varied with context, indicating a need for refinement. AI tools were also more adept at recognising clearer study designs such as randomised controlled trials than more ambiguous observational studies. Many studies utilised retrospective datasets, necessitating further research to confirm these findings in prospective SLRs. As AI technology progresses, ongoing refinement and awareness of these limitations are essential for its reliable integration into SLRs.

From Algorithms to Outcomes: Adventures in AI-Augmented Health Economic Modeling

The potential of genAI to facilitate tasks like translating natural language into programming languages and enhancing model development is recognised as a potential force multiplier for health economists. The use of genAI in health economic modelling is not currently mainstream. However, it was clear at the conference that the potential of AI is already being explored for tasks such as automatically updating inputs within a cost-effectiveness model to localise to a specific market,7 support with model scoping and identifying the most appropriate survival extrapolations to use in partitioned survival models.8,9

The use of AI in health economic modelling and HEOR was the subject of two ISPOR short courses for the first time this year. The short courses demonstrated several use cases for AI in health economic modelling, however it was clear some element of upskilling may be required before efficiency gains can be realised given the requirement for familiarity AI techniques like prompt engineering. AI tools also do not interact with the Microsoft Excel format in the same way that a human user would, so it may be more appropriate to use AI to support with script-based modelling in R or Python. This may limit the applicability of AI in health economic modelling in the short term, as health economic models are currently built predominantly in Microsoft Excel due to limited acceptance of R-based models by HTA bodies. Conversely, the advent of AI could be used to facilitate the move towards HTA bodies being more accepting of models in R or other languages, as AI will allow for faster creation and review of code.

However, barriers remain to effective use of AI in model development. For example, one session looked at using AI to build health economic models and found that genAI invented fake references, hallucinated inputs, and created model health states that did not appropriately link to other health states.10 Many HTA bodies remain tentative about fully accepting AI-generated models as a result.

AI & HEOR: Balancing Speed, Trust, and Innovation in Healthcare Research

There was a general consensus that a 100% human-in-the-loop approach remains essential to ensure reliability and trustworthiness of AI outputs in SLRs and health economic modelling at present, in line with the principles of Responsible AI. This raises questions around the actual efficiency gains associated with the use of AI in the short term given the need for extensive validation. Multiple sessions highlighted that while genAI tools do offer potential speed advantages, further progress is required in key areas before AI can be used confidently in HEOR:1,5,6

 

  • AI hallucinations, outputs that may seem plausible but are factually incorrect, was a particular focal point at the conference. Several strategies to minimise hallucinations were discussed, including retrieval-augmented generation (RAG)-based techniques (an advanced approach which consists of a retrieval phase where AI sifts through specific repositories of information, and a generative phase, where AI assimilates retrieved content with its pre-existing knowledge base).
  • This approach results in responses which are more accurate and enriched with contextual depth compared with traditional ‘one shot’ prompting techniques. One session introduced a RAG-based approach to citation generation which maintained 100% factual accuracy in initial trials, though its wider generalisability remains under evaluation.5 This indicates a promising direction for addressing factual inaccuracies through targeted methods.
  • Model bias from training on skewed datasets and the inherent ‘black box’ nature of AI systems remain significant limitations to the use of AI in HEOR. The need for transparency in AI methodologies and ensuring reproducibility of results was highlighted across several sessions.
  • Introduction of HEOR-specific guidance on appropriate processes and documentation for the use of AI may offer a solution to this challenge. There is currently very little in the way of specific, repeatable processes to follow, and the field instead relies on individuals involved in utilising genAI transparently reporting their processes and verifying outcomes rigorously.
  • An ISPOR task force is in the process of developing guidance for AI in HEOR, and the ISPOR Working Group on Generative AI highlighted the importance of aligning existing AI frameworks with HEOR-specific needs, advocating for evaluation frameworks that focus on acceptability thresholds tailored to individual use cases.5 For example, developing guidance on specific cut-off values for screening sensitivity and specificity to be demonstrated by AI models used in literature reviews before screening independently could facilitate the use of AI in SLRs with minimal human supervision.5
  • The high cost of implementing AI for some applications raises questions about financial viability, especially when weighing the benefits against expenditure. There is also a need for upskilling initiatives to better equip HEOR professionals in leveraging AI technologies effectively and responsibly to ensure that concerns around bias and data privacy can be suitably addressed.

 

Overall, it was clear that the use of genAI in HEOR is only going to increase over time. However, discussions at ISPOR underscored the importance of a balanced approach—leveraging AI’s strengths where they can truly make a difference to efficiency, while maintaining rigorous oversight and evaluation to safeguard the integrity and quality of outputs in healthcare research.

This is well aligned to our approach to integration of AI here at Costello Medical: we believe that maintaining a 100% human-in-the-loop approach is essential during this phase of AI integration and our focus remains on accuracy, quality and reliability while adopting innovations where they can really make a difference to efficiency. We use enterprise-grade security for all AI tools to protect data privacy, and test all technical innovations rigorously prior to wider use. We are currently embedding sector-specific solutions into key workflows across the company where we believe they can truly make a difference:

 

Literature Reviews

  • Trials of genAI integration into our in-house article screening app (while maintaining oversight by two human reviewers) demonstrated significantly enhanced speed and efficiency in abstract screening.
  • Investigating the application of genAI in streamlining the resource-intensive data extraction stage of SLRs.
  • As we continue to build trust in AI, non-systematic reviews that are more exploratory or targeted in nature could provide valuable opportunities to trial both established and innovative AI approaches, whereas it may be more appropriate to retain a 100% human-in-the-loop approach for SLRs for HTA appraisals.

Economic Evaluations

  • GenAI is used to assist some programming workflows and to summarise existing published material, however usage is ad hoc rather than a formalised process.
  • There is clear potential to expand on this, and we are investigating:
    • Application of AI in scoping model structures based on precedent, rather than just summarising prior publications.
    • Potential for genAI to write R-based code for building components of economic models.

Other Applications

  • Developing appropriate prompts for use of genAI in dossier writing.
  • GenAI for insight gathering and HTA strategy.
  • GenAI for conference coverage and insights.
  • GenAI to support plain language summary generation.

References

  1. Issue Panel 127: Can Generative AI Aid Readiness for Joint Clinical Assessment (JCA)? Presented at ISPOR Europe Congress, Barcelona, Spain. 2024.
  2. Rawal A, Ashworth L, Luedke H, Tiwari S, Thomas C, Murton M. MSR107. Developing and Testing AI-Generated PICOS Summaries to Aid in Literature Reviews. Presented at ISPOR Europe Congress, Barcelona, Spain. 2024.
  3. Cichewicz A, Pande A, Casañas i Comabella C, Mittal L, Slim M. Feasibility of GPT-4-based Content Extraction to Identify Eligible Titles and Abstracts in a Systematic Literature Review. MSR136. Presented at ISPOR Europe Congress, Barcelona, Spain. 2024.
  4. O’Donovan P, Metcalf T, Heron L, Yakob L. HTA136. The Potential Use of Artificial Intelligence in Streamlining the Literature Review Process to Support Timely Evidence Generation for JCA Submissions. Presented at ISPOR Europe Congress, Barcelona, Spain. 2024.
  5. Issue Panel 102: Can We Trust AI Output? A Trustworthy AI Perspective for HEOR and RWE. Presented at ISPOR Europe Congress, Barcelona, Spain. 2024.
  6. Session 224: HEOR in the Era of Generative AI: Navigating the New Frontiers. Presented at ISPOR Europe Congress, Barcelona, Spain. 2024.
  7. Rawlinson W, Teitsson S, Reason T, Malcolm B, Gimblett A, Klijn S. Assessing the Generalizability of Automating Adaptation of Excel-Based Cost-Effectiveness Models Using Generative AI. Presented at ISPOR Europe Congress, Barcelona, Spain. 2024.
  8. Podium Session P28: Development of De Novo Health Economic Models Using Generative AI. Presented at ISPOR Europe Congress, Barcelona, Spain. 2024.
  9. Podium Session P24: Innovations in Automated Survival Curve Selection and Reporting of Survival Analyses Through Generative AI. Presented at ISPOR Europe Congress, Barcelona, Spain. 2024.
  10. Podium Session 142: Generative AI: The Next Frontier in Health Economic Model Conceptualization. Presented at ISPOR Europe Congress, Barcelona, Spain. 2024.

If you would like any further information on the themes presented above, please do not hesitate to contact Helen Bewicke-Copley, Consultant (LinkedIn), Ellie Atkinson, Senior Analyst (LinkedIn), Emma Worthington, Senior Analyst (LinkedIn) or Thomas Kloska, Senior Health Economist (LinkedIn). Helen Bewicke-Copley, Ellie Atkinson, Emma Worthington and Thomas Kloska are employees at Costello Medical. The views/opinions expressed are their own and do not necessarily reflect those of Costello Medical’s clients/affiliated partners.

Cookies Overview
Costello Medical

Our website uses cookies to distinguish you from other users. This helps us to provide you with a good experience when you browse our website and also allows us to improve our site. Cookies are files saved on your phone, tablet or computer generated when you visit a website and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.

You can select to accept or reject non-essential cookies using the toggle below. For full details of the cookies we use, please see our Cookies Policy and Privacy Notice.

Non-essential Cookies

We use these to collect information on how our users engage with our website so that we can improve the experience of the website for our users. For example, we collect information about which of our pages are most frequently visited, and by which types of users.