llm product curve

My first weeks working with GPT-4 were magical. I was doing things that I previously thought were impossible. However, as it went from promising proof-of-concept to something I wanted to share, I fell into a trough of sorrow. Every live demo felt like a YOLO moment … who knew what would happen?

My app - a DevOps AI Assistant called OpsTower.ai - could perform a few mic-drop tasks, but getting it to reliably reproduce those results was a frustrating, non-deterministic nightmare. A public release was perpetually one week away.

Fast forward to today: OpsTower.ai is State of the Art (SOTA) in the three categories it competes in on the DevOps AI Assistant Open Leaderboard.

The DevOps AI Assistant Open Leaderboard is a set evaluation datasets for AWS Services, AWS Cloudwatch metrics, AWS Billing, and kubectl commands. The evaluation procedure is open-source and available on GitHub. Disclaimer! I created this to evaluate DevOps AI tools ... and yes, OpsTower (the tool I created) is on top...

In this post, I share how I emerged from my AI trough of sorrow via a streamlined form of Eval Driven Development (EDD).

Table of Contents

  1. What is Eval Driven Development (EDD)?
  2. How to make EDD fast? Eliminate human eval.
  3. Dynamic ground truth
  4. Model-based eval
  5. Implementing my streamlined EDD flow
  6. Conclusion
  7. EDD Resources

What is Eval Driven Development (EDD)?

I first saw this term in Eugene Yan’s seminal blog post Patterns for Building LLM-based Systems & Products. I define EDD as:

Eval Driven Development (EDD) is a process that uses an evaluation suite to guide which levers (prompt, context, model params) to pull (and how far) to improve accuracy.

How does EDD compare to ML evaluation and Test Driven Development (TDD)?

EDD combines elements of machine learning model evaluation and software Test Driven Development (TDD). In the table below, I’ve summarized how model eval and TDD compare across several aspects. I indicate which approach is most applicable to EDD via the ✅ EDD label:

Aspect ML Evaluation TDD
Nature Experimental: involves preprocessing, training, and tuning. Deterministic: involves writing tests, then the code, and getting immediate feedback. ✅ EDD
Feedback Type Probabilistic: results can vary with slight changes. ✅ EDD Deterministic: code either passes or fails the test.
Duration Can be long, especially with large datasets or complex models. Typically short, as unit tests are designed to be quick and focused. ✅ EDD
Infrastructure Requires significant computational resources for complex models. Minimal resources needed for most tests. ✅ EDD
Evaluation Might involve multiple metrics and can be context-dependent. ✅ EDD Immediate and binary: pass or fail.
Tooling Advanced platforms available for prototyping but can be resource-intensive. Wide range of tools for rapid development and continuous integration. ✅ EDD
Determinism Results can vary between runs; uncertainty is inherent. ✅ EDD Results are consistent; code behavior is expected to be deterministic.

An evaluation wrinkle for LLM-backed apps: external systems

LLM-backed apps - especially autonomous agents - are often deployed in environments where the underlying data they access is changing frequently. My autonomous agent, OpsTower.ai, interacts with AWS to retrieve real-time data about a customer’s cloud infrastructure. The data is not static and there are multiple approaches the LLM can take to assemble API calls to fetch information that can lead to the same result. It’s not feasible to build a static test suite or mock all of the possible API calls that the generated code may trigger.

Summarizing the key elements:

  1. The EDD feedback cycle should be fast (like TDD) so you can quickly iterate on prompts, context retrieval, and model parameters.
  2. The feedback is probabilistic (like ML evaluation) as the natural language responses you receive from an LLM may range from incorrect, partially correct, to correct.
  3. The evaluation is dynamic (unique to LLM apps) as the data they access changes frequently and we can’t mock all of the approaches LLM-generated code may use to access it.

How to make EDD fast? Eliminate human eval.

The slowest part of EDD is human evaluation. It takes me 30 minutes to human evaluate a test run. Because it’s slow, most AI engineers revert to “vibe checks” for evaluation.

A vibe check is just running your LLM app and evaluating the result by hand. This is slow, unlikely to have good coverage, and gets tedious quickly.

Why do AI engineers resort to vibe checks when we likely come from backgrounds that value automated testing? It’s hard to come up with an automated system to evaluate an LLM-backed app. For example, here’s an example question and answer flow from OpsTower.ai:

Here’s why evaluating this is hard:

  1. What’s the ground truth? If I run this now, the CPU utilization will be different. I can’t mock every possible API call the LLM provides via code generation that delivers a correct result.
  2. How to evaluate the natural language response? Variations in the text are likely fine, but small variations in referenced metrics can be a big deal.

Let’s see how I’ve approached a solution for the above problems.

Dynamic ground truth

What if rather than using static ground truth like below:

  • Question: What is the average cpu utilization of our RDS instances over the past hour?
  • Ground Truth: The average CPU utilization of our RDS instances over the past hour is approximately 3.71%.

I instead reference a function that generates this context:

For this to work, we need to be confident that our LLM app will return a correct answer if we provide it with the correct context. Thankfully, this is a reasonable assumption. Here’s a slide from Colin Jarvis of OpenAI with a matrix of typical RAG evaluation results:

In this example, incorrect retrieval is responsible for 4x the number of incorrect answers vs. other sources.

Colin shows that only 5% of answers are incorrect when the retrieval is correct. While there are no guarantees with LLMs, the surest one I’ve found for getting an inaccurate response: feed the LLM bad context. For example, if you ask for the weather today but you actually insert the forecast for tomorrow into the context, the LLM will not magically change your context and fetch the weather for the correct date.

So, we can assemble dynamic ground truth like this:

  1. User question
  2. Execute reference function to generate context
  3. Insert context into LLM prompt
  4. LLM generates answer

Next, we need to evaluate test answers versus our dynamic ground truth.

Model-based eval

When we’re working with an LLM, we’re evaluating natural language responses. Natural Language Processing (NLP) is a classical machine learning domain and these models have their own evaluation techniques and metrics like BLEU, ROUGE, BERTScore, and MoverScore. Why don’t we just use those metrics?

There’s actually poor correlation between these NLP evaluation metrics and human judgments. From Eugene Yan’s excellent post Patterns for Building LLM-based Systems & Products:

BLEU, ROUGE, and others have had negative correlation with how humans evaluate fluency. They also showed moderate to less correlation with human adequacy scores. In particular, BLEU and ROUGE have low correlation with tasks that require creativity and diversity.

For example, lets compare two responses to the question “What is the average CPU utilization of our RDS instances over the past hour?”:

  • Ground truth: The average CPU utilization of our RDS instances over the past hour is approximately 3.71%.
  • Prediction: The average CPU utilization of our RDS instances over the past hour is approximately 37.1%.

I moved the decimal point in the prediction, resulting in a far different answer with a small change in the text. Here’s the eval metrics:

Metric Value/Sub-metric Score
BLEU BLEU-1 0.9333
  BLEU-2 0.9286
  BLEU-3 0.9231
  BLEU-4 0.9167
ROUGE-1 Recall 0.9333
  Precision 0.9333
  F1 0.9333
ROUGE-2 Recall 0.9286
  Precision 0.9286
  F1 0.9286
ROUGE-3 Recall 0.9231
  Precision 0.9231
  F1 0.9231
ROUGE-L Recall 0.9333
  Precision 0.9333
  F1 0.9333
Cosine Similarity - 0.8901
A small change in the text (moving a decimal point) has minimal impact on the eval metrics but a significant impact on human judgement.

The scores are close to 1, which indicates a high similarity between the reference and the prediction even though the a human would judge the response as incorrect.

Could an LLM fill in as a human evaluator?

It turns out, LLMs are good substitutes for human evaluators:

GPT-4 as an evaluator had a high Spearman correlation with human judgments (0.514), outperforming all previous methods. It also outperformed traditional metrics on aspects such as coherence, consistency, fluency, and relevance. On topical chat, it did better than traditional metrics such as ROUGE-L, BLEU-4, and BERTScore across several criteria such as naturalness, coherence, engagingness, and groundedness.


Overall, they found that GPT-4 not only provided consistent scores but could also give detailed explanations for those scores. Under the single answer grading paradigm, GPT-4 had higher agreement with humans (85%) than the humans had amongst themselves (81%). This suggests that GPT-4’s judgment aligns closely with the human evaluators.

Personally, I saw almost identical scoring when switching from human eval to LLM eval:


Implementing my streamlined EDD flow

So these are the components we need to implement for a streamlined EDD flow:

  1. Creating a dynamic ground truth dataset.
  2. Implementing an LLM-based eval.

Creating a dynamic ground truth dataset

1. Generate dataset questions (can use an LLM to assist)

To start, I use ChatGPT to generate a few initial questions for a new evaluation dataset. I’m working on OpsTower.ai, a DevOps AI Assistant, so lets create a dataset of questions about AWS CloudWatch Logs. Here is my transcript.

cloudwatch logs

I use ChatGPT to generate dataset questions. This is for an AWS Cloudwatch Logs dataset to test OpsTower.ai.

I then paste these questions in a aws_cloudwatch_logs.csv file:

2. Programmatically generate responses for each question

Next I programmatically generate responses to each of these questions using the current AI agent. For my app, it looks like this:

demo_source = Eval::VendorTest.new.source.save!
test = AgentTest.create!(source: demo_source, dataset_file: "aws_cloudwatch_logs.csv")

When the test completes, I view the answers in the UI:

Above is a screenshot of the custom evaluation results page within my app. It provides summary stats and details on each question-answer pair.

3. Save ground truth functions for generating context

I then review the results. Results generally fall into three buckets:

  1. Works as-is - the agent generated valid context and answered the question correctly. I’ll reuse the functions that generated the context as ground truth.
  2. Hybrid - the agent did not generate valid context, but the code it created to generate context is a solid starting point. I can modify the code then save it as a reference function.
  3. Fully human-generated - the agent failed miserably at code generation. I’ll write a new function from scratch.

I like to go from easy to hard. The easiest ones are “works as-is” as I can simply copy and save the generated code.

To create a reference function from a “works as-is” result, I generate a saved method from the code the agent generated then reference that function by ID in the aws_cloudwatch_logs.csv file.

For example, the question “How many CloudWatch Log Groups do I have?” is correctly answered below:

I want to save the code the LLM generated, starting with the get_cloudwatch_log_groups_count method. In my app, I can do this by executing:

saved_methods = SavedMethod.create_from_chat!("482273fb-4c79-4cd2-bc4d-382945c38e42")

I then paste the saved method ID into the aws_cloudwatch_logs.csv file:

I repeat this process for each “works as-is” result. Hybrid and fully human-generated results are handled similarly, but with more changes to the code.

Ground Truth Prompt Template

I use the prompt to below to generate the ground truth answer from the referenced function we saved earlier. The prompt template looks like this:

An evaluated prompt example:

This will return text similar to “You have 23 Cloudwatch Log groups in your AWS account.”

Model-based eval

The prompt template below is my evaluation prompt. It generates a confidence score, comparing the answer from the agent vs the response generated from the dynamic ground truth prompt:

Here’s an evaluated prompt example (just focusing on template variables):

When the eval is ran, I’ll see output like the following for each question:

This shows the evaluation result on a single question-answer pair. In this case, the result did not pass evaluation.

Rinse and repeat

Once I’m getting acceptable accuracy on an evaluation dataset (ex: 80% or greater), I’ll repeat the process outlined above:

  1. Add questions with ChatGPT.
  2. Add reference functions.
  3. Run evaluation, tweak the app, run evaluation, etc.


I’ve gone all-in on the LLM.

I’ve dramatically increased the accuracy, capabilities, and reliability of my LLM-backed app by leveraging a more automated, streamlined form of Eval Driven Development (EDD). My flavor of EDD leans on the LLM to generate the question dataset, uses human-evaluated reference functions to generate context, re-assembles ground truth answers via the LLM, and finally uses the LLM again to simulate human evaluation.

For my app, the resulting confidence scores of this approach are typically within 5% of human-eval scoring at a fraction of the time spent.

EDD Resources

To see example datasets, reference functions, and evaluation prompts, checkout the DevOps AI Assistant Open Leaderboard on GitHub.

See Awesome Eval Driven Development on GitHub a continually updated set of resources related to Eval Driven Development.