Back to Articles

Code Quality in LLM Applications: Beyond Model Performance

April 1, 2025Andrew Marble

Introduction

Anthropic defines LLM workflows and agents as1:
  • Workflows are systems where LLMs and tools are orchestrated through predefined code paths.
  • Agents, on the other hand, are systems where LLMs dynamically direct their own processes and tool usage, maintaining control over how they accomplish tasks.

Any non-trivial LLM application falls into one of these categories; that is, they combine LLMs with code and data flow to do some useful task. Despite being but one of the components, the LLM gets outsized attention when evaluating LLM apps. Code quality plays an under-appreciated role in their security and performance.

We are building Kereva2, an open source code scanning tool for LLM apps that allows policies or standards related to code to be enforced, in order to mitigate security or performance issues that stem from the code, rather than the LLM itself. The purpose of this article is to discuss the kinds of policies that are important, and how they impact system performance.

Policy Categories

Policies will depend on the application and the organization. But the considerations can be grouped according to the prompts, data, and outputs, as described below.

Prompts

We can consider prompts to be the "instruction" text passed to the LLM. In application code, they are often included as static text, with placeholders for data that get added at run-time. For example, a simplified RAG (question answering based on context retrieved from a database) prompt might look like:

Please answer the question below based on the provided context:
<question>
{question}
</question>
<context>
{context}
</context>

We have an instruction (answer the question) and then placeholders consisting of XML tags enclosing variables in curly braces. When the application is run, a question would be sourced from the user, and relevant context from a database, and these combined into the text sent to the LLM.

OpenAI and Anthropic publish a model specification3 and prompting guidelines4 respectively that provide information on how prompts should be construed for their models. Policies can be designed to confirm that prompts are written in accordance with these guidelines. For example:

  • OpenAI mentions that plaintext in quotation marks, YAML, JSON, XML, or untrusted_text blocks are all treated as un-trusted.
  • Anthropic states that inputs should be enclosed in XML tags.

These naturally give rise to policies we can scan for.

APIs often take care of prompt formatting such as adding tags to denote user and system prompts, but some local model frameworks like HuggingFace allow this formatting to be applied programmatically. Again, checking that the correct formatting is being applied can be enforced in a code scan.

Prompts can also negatively impact model behavior. For example, a recent study identified prompt characteristics that could result in bias5 in LLM output. Prompts that are vague or under-specified can also lead to over-reliance on the model's "judgement" and not reflect user intent. For example, asking a LLM to pick the "best" candidate for a job, without specifying criteria, exposes the selection to any biases or preconceptions built into the model.

Organizations may also have specific prompting practices related to system prompts, included information, rules for the LLM to follow, etc. that can be enforced.

Data

LLM apps will often combine user input with other data such as search results in prompts. This data may be trusted (for example from a pre-vetted database, or from users with the same privilege level) or untrusted (such as web data, documents from un-trusted users, etc.). An exploit discovered in Slack's AI features last year consisted of a user without privileges (for example not part of a private channel) uploading a document to a public channel that contained a prompt injection attack6. When a privileged user viewed the malicious document, the hijacked LLM would summarize privileged information and then exfiltrate it, encoded in base64 through hyperlink rendered in markdown to look benign. The prompt itself is often not trusted in public facing systems, where prompt injection could lead to embarrassing behavior or sharing of unwanted information.

Steps to prevent such attacks can include sanitization of inputs. Organizations may want to enforce specific sanitization requirements, such as using packages like Guardrails7. Compliance with a sanitization policy can be checked at the code level.

Data flow also has a major impact on system performance characteristics, such as answer correctness or hallucination. When an LLM is using contextual data to make a response, it's important that this data is complete, relevant, and unambiguous. While not all of these aspects stem from code quality, some standards can be enforced through code scans. Examples include:

  • The "chunking" strategy – how longer data is divided up to be fed to a prompt
  • The content presented to the LLM (such as including context like document and section titles)

Output

After the LLM provides a response, the handling of this output affects security and performance. Enforcing output structure and limits with in-built API functionality or tools like Outlines8 can be an important first step. For example, using enums (lists of allowable values), types, and range limits to constrain allowable output can improve both security and performance.

When LLM outputs are used to take further programmatic steps, agency is also important. OWASP defines excessive agency as a "vulnerability that enables damaging actions to be performed in response to unexpected, ambiguous or manipulated outputs from an LLM…"9 This can happen when function calls or code execution are based on LLM outputs. Organizations may want to ensure that:

  • No LLM output is directly executed, such as through Python's exec command or a shell
  • Potentially dangerous functions that can be called on LLM output are either scanned for or limited to a known list
  • An LLM agent that executes shell commands as part of its output could be prevented from using dd or rm or allowed only to use ls or grep if these were deemed harmless

Output rendering is also important10. Some attacks use markdown rendering to conceal harmful hyperlinks. Displays that render "social previews" – those links that contain a little picture and some text that appear when you post a url in Slack or Messenger – will automatically send a request to an outside website that can be a data exfiltration vector. Organizations may want to limit how output is processed or rendered to only "safe" methods, possibly disallowing hyperlinks altogether.

Finally, output sanitization, for example checking for commercially inappropriate outputs, along similar lines to input sanitization, may also be an appropriate policy. Especially concerning are completely unsanitized paths from an untrusted input to an output (either presented to a user or prompting an action). With current LLMs, all of which are susceptible to jailbreaks, such paths can allow an attacker to manipulate the output as they want.

Conclusion

As LLMs have improved over the past few years, their performance has received the most attention in relation to validating the behavior of LLM apps. As we've seen, the code quality is also extremely important. Organizations building or using LLM apps should be aware of the relationship between code features and application performance and security, and work to define and enforce policies to ensure their apps are built responsibly.

The considerations given here are just examples. It's worth reviewing organizational security and performance requirements, understanding the specific features of the code that may have an impact, and enforcing appropriate policies.