What does the Emergence of AI really mean for Evaluation and Evaluators?

Since the launch of ChatGPT in late 2022 and the explosion of similar platforms, it has become evident that Artificial Intelligence carries significant potential to reshape industries and change our daily working practices. From more useful customer service chatbots and fraud detection in finance to creating content for entertainment and self-driving cars, the possibilities for leveraging AI seem endless. But what does this mean for us as evaluators?

For the past few months, we’ve been thinking about the emerging challenges and opportunities that the rapid development of AI poses for evaluation.  We’ve been discussing the various ways in which AI is likely to impact our work, recognising that it is now not a matter of ‘if’ (that boat has sailed), but ‘how’? We’ve been asking ourselves how AI can help us, and how, and how it might hinder us.  We have also been asking how we utilise this emerging technology in an effective and beneficial way while maintaining the inherently nuanced and ‘human’ nature of development evaluation and consultancy.

Our emerging understanding of capabilities and limitations

The rapid ascent of AI tools and systems– particularly generative large language models (LLMs) such as ChatGPT – urges us to rethink our approaches and explore their opportunities and challenges. As we have begun to explore the possibilities of AI in the evaluation space, we have had to ask ourselves as an organisation how we can harness these to strengthen our work while preserving the critical human touch that defines our practice.

So, what is or will AI’s role be in our work? Eventually, we imagine AI/LLMs will be as ever-present as smartphones, so we have started to define some of their capabilities and limitations within the evaluation sector as we see them. Some of the capabilities that we have identified thus far include:

  • Efficient summarisation and analysis of both quantitative and qualitative data.
  • Considerable potential for rapid and accurate evidence retrieval and management.
  • Automation of burdensome administrative processes.

This list of capabilities is likely to expand as we continue to investigate the opportunities presented by emerging AI tools and systems.

At the same time, we have been careful to identify the limitations and risks associated with the uptake of AI tools and systems. Some of these that we have identified at this early stage include:

  • A Lack of nuance in understanding cultural and socio-political contexts.
  • The absence of empathy, which we believe is key to effective and ethical evaluative work.
  • Potential bias and inaccuracy when used for data summarisation and/or analysis, highlighting the importance of maintaining a ‘human in the loop’ when these tools are deployed.
  • Varying levels of transparency and explainability; it is often not possible to explain how results were generated when using commercially available LLMs.
  • The need for constant vigilance to ensure data security; some commercially available LLMs are trained on the data that is provided to them for analysis, and in such cases it is crucial to ensure they are not exposed to personal and sensitive data.

Using AI in evaluation does not occur in a vacuum, and we are conscious of the ethical, technical and regulatory impacts (among others) it might have on our work. Our use of AI will be subject to national and international regulations, and we will need to be up to date on evolving data protection laws and their implications for cross-border evaluations. Following the emerging regulatory frameworks for AI, particularly in the UK and the EU, will be key. We will also need to keep abreast of standards for AI use defined by relevant professional bodies, including the United Nations Evaluation Group.

To address some of these issues, we have established an AI ‘working group’, made up of some of our colleagues who feel passionately about the technology and its potential.  So far, this has resulted in the development of IOD PARC’s principles-based AI policy which outlines a set of behavioural expectations for staff and associates to ensure that our use of AI is fair, transparent, ethical and safe.

Conversations are still ongoing and are crucial in the generation of our next steps as an organisation. While they continue, however, it’s important to engage in conversations which are happening outside of our own space, as these are what will ultimately shape the work we do. To this end, we recently organised a webinar to explore the various challenges, benefits and risks associated with the use of AI technologies in evaluation. This webinar was attended by our staff, associates, and clients, and featured presentations from the evaluation units at WFP and Unicef, as well as from independent evaluators, academics, and a specialist provider in AI-powered data analytics for evaluation.

 

Using AI in evaluation: a recent example

IOD PARC has recently completed an evaluation that involved the development of a custom-built Natural Language Processing (NLP) model that assess the extent and quality of cross-cutting issue integration (gender, human rights, climate change etc.) in the digital archives of a major bilateral donor. NLP is a machine-learning approach to text analysis which can be used to implement tasks that are laborious for humans at scale, such as identifying language, sentiment analysis, translating documents, finding terms that commonly appear together, and extracting keywords, content, form, and meaning from text passages.

For the evaluation, we developed an NLP model that automatically classified almost 90,000 project documents, based on an automated appraisal of their content with respect to cross-cutting issues.  This involved first manually assessing a more limited sample of documents demonstrating varying approaches to and qualities of cross-cutting issue integration.  These were used to derive a set of logical, linguistic rules that the model would use for classifying the rest of the documents in the sample. These logical rules were translated into machine-readable rules using extensive subject-specific keywords combined with NLP techniques. Within this rules-based approach, we maintained a focus on model explainability.

At periodic points of the implementation, keyword lists and manual rules were augmented with automatic rules suggested by the model based on common language and structure appearing in passages leading to each categorisation. Human interpreters verified these automatic suggestions. Automatic rules provided two functions, both augmenting the precision and accuracy of the model and providing a sense-check that the model was correctly classifying documents.

The results of the model provided valuable insights into how the organisation’s project documentation approached cross-cutting issues. These insights independently contributed to the evaluations overall findings, and also served as a useful point of triangulation for results that emerged through other evidence streams.