Back to Articles

Deterministic scans of AI model implementations

March 15, 2025Andrew Marble

Introduction

Test-based evaluation of AI systems forms only a part of the quality assurance process. We discuss other offerings that add on to evals, and mention an open-source static analysis tool we have built to scan code implementing LLM workflows for issues that could lead to security and performance problems.

We previously discussed the gaps in using "evals" or test-based performance measurement for systems that use large language models*. Part of the problem is that evals look at behavior and don't go to root cause. Passing evals does not mean there are no underlying issues, it just means particular symptoms are not present.

Understanding AI Model Complexity

AI models, although technically deterministic (with caveats), are intractably complex and so we cannot predict how they will respond to every input, we can only test. But, when we evaluate systems and find deficiencies, these are very often traceable to some deterministic root cause and not random fluctuation. For example, hallucination in RAG is likely to arise from incomplete or ambiguous information included as context, rather than appearing out of the blue. On the other hand, some behaviors, like security concerns due to prompt injection, are an inevitable consequence of LLM architecture. These can only be eliminated by sanitization of inputs or outputs; behavioral evaluation only checks how easy the exploit is for some known set of attacks, not whether it's possible.

Architecture Analysis Tools

The point is that system architecture is important, and evals don't test architecture directly, just behavior. There are two families of tools we've seen that emphasize architecture:

1. Execution Trace Analysis

First are those that log and analyze traces of AI program execution:

  • Arize maintains an open source observability framework that records execution traces during evals and/or production runs. These are particularly useful when building with more complicated frameworks like LangChain where data flows through many steps before the actual LLM call. The traces let you examine and debug this flow.

  • Invariant Labs has a trace analyzer that scans agent execution traces for evidence of bugs or security violations. They support user defined policies to scan for specific issues and built-in security scans.

The tracing approach is good because it gets at both the behavior and architecture to give a full view of why the system behaves as it does. Additionally, the scanning capabilities of such tools can support runtime scanning, similar to software like Guardrails.

The downside is that these scans rely either on test or runtime inputs and are in that sense reactive - the only way to find an issue is when an input that triggers one comes along. Thus the Rumsfeld problem (we don't know what we don't know) and its cousin (we may know jailbreaks are possible, we just didn't find one) still exist.

2. Direct Code Analysis

The second option for scanning is direct code analysis:

  • Protect AI's Model Scan looks at the model code itself for security exploits. However, these focus on supply chain attacks compromising model execution rather than behavior vulnerabilities of LLMs (like prompt injection).

  • Agentic Radar by splxAI scans code (as of writing it focuses on agent frameworks from LangChain and CrewAI) to look for potential security issues. The advantage here is that the code scan addresses the root cause rather than behavior and so gives more confidence about coverage of vulnerabilities because it doesn't rely on testing specific prompts. It can also be applied simply during development, run offline. The tool provides workflow visualization, tool identification, and vulnerability mapping to frameworks like OWASP Top 10 LLM Applications.

  • Kereva-Scanner is our recently released framework and application agnostic code scanner that looks for both security and performance issues in LLM code. It currently supports python and uses abstract syntax tree analysis to look at both prompts and code flow to scan for different vulnerabilities. The scanner is open-source, with pre-built scans, and custom rules.

Looking Forward

For traditional software, unit tests, trace-debugging, and static analysis tools all form a toolkit to check for application security and quality. The same evolution is happening with LLM software. Speculation is that evals have had the emphasis because of how LLMs evolved out of academic AI where model validation was the core activity in testing performance. But as LLMs become the foundation of commercial software, there is going to be increased emphasis on other testing, and static scanning has a role to play that is currently under-appreciated.