Why Evaluation, Not Practice, Makes Perfect AI

A change in perspective may be the key to getting the greatest value from AI tools.

ChatGPT for coding

If you’ve interacted with artificial intelligence (AI) you’ve probably received outputs that were just what you were looking for – and you may have had the opposite occur as well. Think: a virtual assistant that doesn’t provide you with the information you need the first (or second) time you ask. So, considering that an AI model could deliver both desirable and undesirable outputs, is it possible to attain perfect AI?

Erik Vogt, vice president, enterprise solutions at Appen, shares his insights.

Is perfect AI attainable?

Vogt: Perfection is not an option – it’s not even a meaningful concept. Speaking of which, there isn’t a way of defining perfect human intelligence either. As we think about intelligence of any kind, it’s worth looking very closely at what one is trying to achieve, and what tools and processes are being used. For example, a pocket calculator is pretty close to perfect mathematical intelligence for the functions it is programmed to process. What is much harder is to formulate the question – and that is not something AI is designed to do.

Since generative AI is what we are talking a lot about these days, we might consider what it means to have a perfect answer. I’ve run a lot of queries to ChatGPT and got a sense of what it can do; the capabilities are impressive, but the limitations are evident. ChatGPT can answer a lot of questions really well for which it has ample data, but it can’t answer what it doesn’t know, and it clearly doesn’t know everything. It also can make mistakes and confidently assert that nonsense is an established fact. In that sense, ChatGPT maybe replicates human intelligence a little too well (although for very different reasons!) So, the real question isn’t if “perfect” AI is possible, but rather how we can get better at asking better questions so that AI can give us the insights and answers we can use.

Considering a narrower definition of what you want AI to do, especially with the potential to tune the solution to fit a specific use case either through better training, better prompt generation, or better context support, AI capabilities are already able to add value. We have already adapted AI, broadly speaking, in many narrow fields, such as the game of chess or in fraud detection, but there are an unbelievable number of current and potential use cases for AI, and they all have different value properties and challenges. I seem to see interesting new use cases every week, and we’ve only begun to see the potential.

To make specific AI deployment goals a reality, one must make sure you understand the question you are asking of it, what the consequences of failure are, where the value of a better answer can be realized and implement processes that ensure the data used to train the model is reflective of the use case for which the model is being developed. For models that interact with people or whose output has a commercial impact on people, most likely, addressing bias and considering privacy law will be paramount. If your model is intended to operate a vehicle, there are several AI functions that all have different functions, from understanding the surroundings, navigation, understanding driver behavior including verbal input and output, and much more. Each of these systems uses a different kind of intelligence, and has different problems to solve, data to manage, and consequences of failure. And of course, regardless of how intelligent a model can be, it still requires a correct formulation of the right question.

Explain the testing process via the human in the loop during training.

Vogt: Human in the loop is a branch of AI development that combines machine learning capabilities with human oversight to create stronger machine learning models. The process involves presenting humans with some input of low confidence to output higher confidence data. It would surprise many to know just how vast an army of humans is involved in providing high-quality search functionality. AI aside, search relevance is about aligning the output of intuitive queries to high-confidence results, all while constantly adding an increasing amount of input data, contamination, and all sorts of other garbage. Without AI, given the changing data and search needs, the net performance would be going down without both development of improved models and the careful application of human-driven data collection, annotation, and validation. Through not just continuous human insight and input, the people behind AI become just as important to its success as the machines and data powering it.

Recent research has also underscored the increased emphasis AI workers are putting on the human-in-the-loop approach. In fact, 97 percent of respondents to Appen’s State of AI:Automotive survey noted that human-in-the-loop evaluation is important for accurate model performance, further confirming that practitioners are prioritizing more collaboration and oversight across the AI lifecycle.

Is additional training or testing necessary after implementation? If so, what’s the best way to manage it?

Vogt: Yes, as any data scientist will tell you, consistent and ongoing testing and training are critical for the performance of any but the most contained AI solutions. A few key areas to consider include determining the metrics you will use to evaluate if the model is performing, usually starting with accuracy, precision, recall, and F1 Scores. Most models interact with the real world and the real world is changing, so even if the model performed sufficiently to being with, new input may not continue to be handled the same way. In addition, user expectations change, new products or services may be added, and in many cases, there are good reasons to expand the utility of a deployed model by expanding the base of “skills” the AI can handle. Also, it’s important to know the model bias and to monitor it for bias. In some cases, new government regulations can also change the expectations of fairness, privacy, and levels of precision. Like software, one should consider any deployed model as being in constant development until it sunsets.

How can AI keep up when the data it interacts with updates?

Vogt: Robust data is the lifeblood of any AI deployment, both in terms of the initial training data and the veracity of test data. By providing AI algorithms with robust and up-to-date datasets, AI practitioners can help maintain and hopefully improve the performance of deployed models. Put another way, not measuring the performance of your model routinely reduces the confidence of the output and increases the risk of, and possibly the consequences of, failure – however that is defined. Given increasing user expectations for accuracy, increased competition, tighter regulatory environments, and rapidly changing input data, one might say that not updating your data is appropriate only for models you are already planning to sunset soon.

Again, human oversight is vital, both from the standpoint of improving the confidence of training data and for validating model performance.

Can you provide any additional information on how AI can provide the greatest value?

Vogt: I was listening to a radio show the other day about how AI was being handled in schools. Some schools are trying to ban the use of AI, others are trying to understand how humans can still learn with AI ready to answer any question that it’s presented with. Some are considering ways to incorporate it into the classrooms and to help prepare students for a future of working with, and for some, helping to train and manage AI systems. The theme that seems to come up again and again for me is that this is a great opportunity for us to focus on how developers add value by asking better questions.

In a world in which answers abound, knowing what to ask and how to ask it will become the most valuable skill of the future. Knowing the right questions can better find the limitation of AI models, better understand appropriate use cases, and better anticipate risks and opportunities. As with any technology we develop, understanding why we’re building it, who and how value is created with it, and how it will be used, misused, abused, and misunderstood will be among the most valuable skills for developers. Whenever one writes code, whether with the assistance of AI or not, one needs to know what it’s intended to do and to know how to evaluate whether it met its intended purpose.

Mike Monocello

The former owner of a software development company and having more than a decade of experience writing for B2B IT solution providers, Mike is co-founder of DevPro Journal.

Mike Monocello

The former owner of a software development company and having more than a decade of experience writing for B2B IT solution providers, Mike is co-founder of DevPro Journal.